embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.64k stars 212 forks source link

Paper segment: Speeding up retrieval #836

Open KennethEnevoldsen opened 1 month ago

KennethEnevoldsen commented 1 month ago

The goal of this segment is to:

1) subsample retrieval datasets 2) examine if scores correlate in terms of model rank (probably related to #835) 3) show a meaningful speedup (might be worth also examining #793)

KennethEnevoldsen commented 1 month ago

Hi @orionw seems like this segment is the next on the agenda - might be nice to have in before we start running the larger models

(e.g. #838 seems dependent on this)

imenelydiaker commented 1 month ago
  1. show a meaningful speedup (might be worth also examining Speed up Reranking tasks #793)

This can be done easily I think, for reranking it was just hashing the texts to encode them once instead of each time they are used. Can we create a task/issue about this?

The rest of this issue is very expeimental.

KennethEnevoldsen commented 1 month ago

@imenelydiaker it is not quite as trivial as documents are encoded in chunks - so you would need a bit more handling, but otherwise you are correct.

orionw commented 1 month ago

@KennethEnevoldsen I was planning to work on it this week, starting with a medium sized dataset. Will post the results here in this issue.

If this is too slow, someone else can feel free to start - apologies about that!

KennethEnevoldsen commented 1 month ago

I think it is fine. I am just making sure that the ball is not dropped.

isaac-chung commented 1 month ago

I figured there's enough overlap in methodology between this and #835 (clustering) that I'd listen in on this thread. Encoding sentences and paragraphs should not have the same issue as encoding documents, so hashing the text would be a good first step for clustering.

What I was not clear on was model rank. This is how I interpret it: We would use, say, 3 models (ones we have results for), run the "fast" version of the tasks within e.g. the English benchmark, and then somehow get the ranks of the models? I think this is where I'm lost.

There are a few observations from the ClusterFast results that I can share more in #835, just to keep the discussion here separate.

KennethEnevoldsen commented 1 month ago

So regarding comparing rank (let us leave significant out for now):

We can obtain the ranks models in two ways: 1) rank them according to their score (v_measure) on task A (E.g. arxivClusteringP2P) 2) rank them according to their score on task A.v2 (E.g. arxivClusteringP2P.v2)

These two methods should then obtain a spearman ranks correlation of ~1. If we just rank them I believe it would be equivalent to just using Spearman ranks correlation on the v_measure, though if we take a significant rank then it is not. This can however only be done for clustering as we don't bootstrap retrieval scores (though we could)

isaac-chung commented 1 month ago

Got it, thanks. As for which models to use here, are there any criteria we should use, e.g. 3 random models that are beside one another on the LB (e.g. rank 4,5,6)? or could we just use the e5 small/base and paraphrase-multilingual-MiniLM-L12-v2 ?

KennethEnevoldsen commented 1 month ago

I would probably go with a small, medium, large, and then "close-relative" for the large one (we want to differentiate if we do a spearman's correlation matrix we should have perfect for small, medium, large and close to perfect for large and relatives)

orionw commented 4 weeks ago

Okay, some results on this.

Background

I wanted to analyze this on a few different axes (maybe overkill, let me know):

  1. Number of relevant documents per query. Some have a lot, some a little. This will be heavily impacted if we reduce the collection size
  2. Variations of model performance. We want to be sure rank is consistent even if scores are not (as the task will get easier when we remove more documents)
  3. Amount of hard negatives. Ideally, we'd have a pool of 3-5 of the best and medium models giving us negatives to keep the task hard.
  4. Since we are only doing this for large datasets, I am going to ignore smaller sized datasets and only focus on those > 100k collection.

Caveats: I wasn't able to do (3) due to GPU constraints and I was only able to get a medium amount of models for (2).

Results

Results are in a comment-able spreadsheet as they were too big for Github.

I picked two datasets to checkout (1): TREC-Covid (~500 rel docs per query) and NQ (~1 rel doc per query). I then used E5-large-v2 to gather hard negatives, optionally adding all the true positives before pooling. Ideally, I would have done this with multiple models, but I was running into GPU/time constraints.

Ranking Performance The results show that for datasets with a large amount of relevant documents, you quickly lose ranking ability. At 100 docs per query (compared to 500 rel docs), the system already loses some ranking ability. However for NQ, you can reduce it to all the way down to 2 docs (per query!) and keep ranking stable -- pretty wild.

Score Deltas If you look at the raw scores you can see that removing the documents makes the task much easier, especially for NQ (only moderately so for TREC-Covid). For example at 2 documents per query for NQ, scores are in the mid 80s instead of the 50s. However, it does seem like you can reduce it 50-100 docs per query and maintain similar score numbers to the original performance.

Adding true positives to document pool This seemed to make little difference but keep scores generally from inflating as fast. Seems like we should do this.

Takeaway and Next Steps

Even with just one moderately performing model (E5-large-v2) we can see good results using hard negatives. If we want a "lite" MTEB I would recommend we pick some number (perhaps 100-200 docs per query) and use that, in order to maintain ranking performance and similar score ranges (after all, if models are getting 0.90s on a task, it will quickly become saturated). However, we should be sure that the number of docs per query is equal or larger than the number of relevant documents per query, on average.

I currently don't have too many good machines at my disposal, so results on NQ for larger models are still running. It would also be great to verify that if we use 3-5 models for pooling these results stay the same or improve, but I'm pretty confident that will be true.

If @vaibhavad has access to more GPUs than me at the moment, perhaps we can team up. I have scripts that can do most of this automatically if we want to run other experiments or downsample the real datasets.

The experimental repo is here: https://github.com/orionw/mteb-lite

KennethEnevoldsen commented 4 weeks ago

As for compute @Muennighoff or @vaibhavad is probably the best bet (alternative @mrshu might be an option as well)

orionw commented 3 weeks ago

@orionw I would probably go a little higher (500 documents) assuming that there will also be an overlap in the documents

For some datasets at 500 it would be the whole corpus (e.g. TREC-Covid, I think MiRACL).

How big is the reduction in documents for the different approaches? It might be worth doing e.g. 500 and a max on N queries

They docs are mostly distinct, so it's just the multiplication of the number of queries by the number of docs.

KennethEnevoldsen commented 3 weeks ago

Might be relevant to look into reducing the number of queries as well then?

mrshu commented 3 weeks ago

Happy to help with the compute!

orionw commented 3 weeks ago

Thanks @mrshu, let's coordinate on some of the next steps once we figure them out.

@KennethEnevoldsen Do we have some table where we have the new datasets sizes listed? I think we don't have qrel information in those automatic numbers but I can calculate them on the side. I think it would be helpful to know what we're working with here.

For example, TREC-Covid has only 50 queries, so reducing it would be pretty detrimental. However, NQ has ~2800 so we could perhaps reduce it by half or so.

I still think 500 docs per query is too conservative -- if we're going to be changing the dataset and making it incomparable to all previous results, I would opt for a more aggressive cut to make it worth it (at most 250 or max of the number of relevant documents).

Between lowering queries and lowering docs though we should at least be able to reduce it to a quarter of the time, which is significant.

KennethEnevoldsen commented 3 weeks ago

@KennethEnevoldsen Do we have some table where we have the new datasets sizes listed? I think we don't have qrel information in those automatic numbers but I can calculate them on the side. I think it would be helpful to know what we're working with here.

We don't but it might be ideal to reformat the sized that we posts for retrieval such that we in fact know this.

I still think 500 docs per query is too conservative -- if we're going to be changing the dataset and making it incomparable to all previous results, I would opt for a more aggressive cut to make it worth it (at most 250 or max of the number of relevant documents).

You are probably right. I am fine with this as well. (results seem more stable than the clustering, which is probably due to retrieval having been better formulated from the start)

Between lowering queries and lowering docs though we should at least be able to reduce it to a quarter of the time, which is significant.

A quarter is wonderful. I believe MTEB took >44hours for LLM2Vec, so a 4x would be wonderful (naturally, not all of that is retrieval)

vaibhavad commented 2 weeks ago

@orionw - happy to help with the compute as well. Let's coordinate on this soon!