Just the indexing code for now (will add the rest tomorrow), but opening the draft PR in case you wanted to take a look at this before the rest comes in!
Goal of the PR
Add support for ColBERT models, starting with Answer.AI's ColBERT-small via an API Answer will host (discussed with @okhat who's also okay with this being the first ColBERT representative), in order to see how multi-vector models of various sizes fare on this benchmark. The querying mechanism within the API is very simple and lives at AnswerDotAI/mteb_arena_colbert_api.
Changes
The PR relies on an external API, where the index is hosted and queried, and which will simply return documents. It doesn't change the logic of any existing mechanisms.
It adds the ColBERT indexing code for full reproducibility
TODO: It adds the querying mechanism, using API calls to fetch the highest scoring document for a given query.
TODO: It adds utilities to download the pre-built indexes from Wikipedia to be able to query them locally.
Hey @Muennighoff!
Just the indexing code for now (will add the rest tomorrow), but opening the draft PR in case you wanted to take a look at this before the rest comes in!
Goal of the PR
Add support for ColBERT models, starting with Answer.AI's ColBERT-small via an API Answer will host (discussed with @okhat who's also okay with this being the first ColBERT representative), in order to see how multi-vector models of various sizes fare on this benchmark. The querying mechanism within the API is very simple and lives at AnswerDotAI/mteb_arena_colbert_api.
Changes