embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.63k stars 212 forks source link

Versioning for AbsTasks #851

Open x-tabdeveloping opened 1 month ago

x-tabdeveloping commented 1 month ago

Currently, AbsTasks only differ by name. Whenever we want to update an abstask there are two things we have to make sure:

  1. Most people and new tasks use the new version.
  2. Results on older tasks should preferably not change, especially after we start running the models.

Our solution for this with clustering was to introduce a completely new task, and then create an issue for converting all tasks to the new version. Currently, there is a similar problem with PairClassification #756. Again the new version is no longer backwards compatible and would produce different results from the previous one. @KennethEnevoldsen brought it up that we could have a test to see whether someone was using an old version in a new task, this would have to be hard-coded though, which is not too nice in my view. @Muennighoff also suggested that we rename the outdated abstask, but then we need to rewrite every old file in order to adhere to this principle every time we update an AbsTask.

I have multiple suggestions for what we could do here: a) We could do the thing we already do with concrete tasks, and go down the superseeded_by road. b) We could introduce some versioning system for abstasks.

It has been suggested that we could add V1, V2, etc. to task names. I'm not sure, I like this because then we would have to rename every currently existing abstask.

It would be really nice to have a version attribute on AbsTasks, and then an abstask_version attribute on particular tasks, this would however, be incredibly hacky to do with the current way we manage these things (inheritence).

Do you guys have any good ideas on how to solve this? I'm currently trying to write a new AbsTask for PairClassification, and it would be really nice if we could figure some system out.

KennethEnevoldsen commented 1 month ago

I would just add a failing test stating "Please don't use AbstaskClustering but instead use AbstaskClusteringFast" with an exception list of all old tasks of the type.