Thoughts on accelerating the evaluation speed: supporting DDP/FSDP model inference + multi-task evaluation in parallel

ShengYun-Peng commented 4 months ago

Since models nowadays can easily reach over several billion parameters, I'm curious if MTEB will support DDP/FSDP model inference instead of just using DP. DDP/FSDP will be more intrusive into the code including both the dataset side and model side modifications. Any suggestions on this?

KennethEnevoldsen commented 4 months ago

MTEB does allow you to implement any model that you want only specifying minimally an encode interface. This e.g. allows you to use multiple GPUs (from docs) as you note:

Using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. here or here.

Is there a reason why such an approach would not work for DDP/FSDP?

ShengYun-Peng commented 4 months ago

Hi, thanks for the quick response! DP only requires one line of change in the model, but this is not the case for DDP/FSDP. The later requires distributed sampler while loading the dataset so that each gpu knows which portion of the dataset should be loaded to itself. After model inference, the pipeline needs to gather outputs from all gpus and perform evaluation on all outputs. It's a bit more intrusive in code than DP. A simple DDP distributed dataset is here.

ShengYun-Peng commented 4 months ago

Correct me if I'm wrong: Is current evaluation pipeline sequentially evaluating all input tasks? The while loop here and the task deletion at the end seems to verify my assumption. If so, I'm curious if the evaluation pipeline can be done in parallel for multiple independent tasks.

KennethEnevoldsen commented 4 months ago

Correct me if I'm wrong: Is current evaluation pipeline sequentially evaluating all input tasks? The while loop here and the task deletion at the end seems to verify my assumption. If so, I'm curious if the evaluation pipeline can be done in parallel for multiple independent tasks.

There is no reason why the while loop couldn't be an async map operation as far as I am aware. Do you have an approach in mind we could build in to accommodate faster processing then I would be very open.

For smaller models running through the benchmarks, the time is predominantly spent on downloading datasets.

ShengYun-Peng commented 4 months ago

Thanks so much, @KennethEnevoldsen! I don't have a specific method in mind, so I want to hear your thoughts on that. Besides, I'm wondering if LLM2Vec has been incorporated in the testing pipeline. I'm new to MTEB, so I am ramping up on both the mteb repo and the mtebscripts repo.

KennethEnevoldsen commented 4 months ago

Hmm, my guess is that processing the documents pr. tasks is the real bottleneck for larger models so I don't think you gain too much by parallelizing tasks (except for smaller models, which aren't bottlenecked by the encoding time)

Generally, MTEB doesn't implement the model (it just specifies an interface to which the model has to adhere). However, we have recently added a model registry to keep an implementation of the model for reproducibility. Future models will probably also be implemented. The LLM2Vec is not implemented, but you can see an implementation on the model repo (e.g. this one).

In fact you can add a model to MTEB by just adding the scores to your model card.

ShengYun-Peng commented 4 months ago

Thanks so much! What is "documents pr. tasks" in "... documents pr. tasks is the real bottleneck for larger models ..." above?

Also, it looks like a model encode function needs at least batch_size as one of the args other than sentences based on the evaluator code here. Is there a full list of args that model.encode should take, i.e., function signature?

KennethEnevoldsen commented 4 months ago

Ahh sorry for each task you typically have to encode between 100 to >100 000 documents. This is typically the bottleneck for most cases. You can parallelize this process however you like (no need to use the batch_size argument in your encode function)

A function signature is available here: https://github.com/embeddings-benchmark/mteb/blob/437d6df49406a388abe6494eeaecd56c50c8dbe3/mteb/encoder_interface.py

ShengYun-Peng commented 4 months ago

Hi @KennethEnevoldsen, thanks for answering all my questions above! I'm curious whether MTEB is considering migrating the evaluation pipeline onto pytorch so that we can use GPU to speed up the downwards classification and clustering tasks. Currently, everything is implemented in sklearn on cpu.

KennethEnevoldsen commented 4 months ago

@ShengYun-Peng I don't believe this is the time-consuming part of the process (if you find evidence that it is do let us know). Atm. we are focusing on speeding up task where we believe and have estimates it takes the longest. This is notably for retrieval tasks (see #836) previously we e.g. sped up clustering by 7x-100x (see #481 and #835) and download speeds #651.

Fitting the classifier/clustering I don't believe is the slowest part of the process so I don't believe that will be a priority for us. However, we do encourage PR.

A few pointers for such a PR: Sklearn is starting to support pytorch arrays for certain classifiers and a potential PR might implement it such that we just use pytorch arrays throughout the codebase allowing for GPU compute. However it should be noted that many of the sklearn models that we do use does not seem to be compatible (yet) wit pytorch arrays.

ShengYun-Peng commented 4 months ago

Thanks so much! I'll check that. Meanwhile, what is the ideal way/purpose of using mtebscripts?

KennethEnevoldsen commented 4 months ago

This seems like a new discussion so will close this one down. Feel free to create a discussion on it though (generally I think you can use MTEB as is without mtebscripts).

ShengYun-Peng commented 4 months ago

@ShengYun-Peng I don't believe this is the time-consuming part of the process (if you find evidence that it is do let us know). Atm. we are focusing on speeding up task where we believe and have estimates it takes the longest. This is notably for retrieval tasks (see #836) previously we e.g. sped up clustering by 7x-100x (see #481 and #835) and download speeds #651.

Fitting the classifier/clustering I don't believe is the slowest part of the process so I don't believe that will be a priority for us. However, we do encourage PR.

A few pointers for such a PR: Sklearn is starting to support pytorch arrays for certain classifiers and a potential PR might implement it such that we just use pytorch arrays throughout the codebase allowing for GPU compute. However it should be noted that many of the sklearn models that we do use does not seem to be compatible (yet) wit pytorch arrays.

FYI, LLM2Vec paper has stated that a full evaluation on MTEB took ~30-40h. I think it's a good example for time-consuming evaluation as LLM evaluation is becoming more and more popular on the benchmark now.

KennethEnevoldsen commented 4 months ago

FYI, LLM2Vec paper has stated that a full evaluation on MTEB took ~30-40h. I think it's a good example for time-consuming evaluation as LLM evaluation is becoming more and more popular on the benchmark now.

I totally agree with this we are also heavily focusing on speed in the upcoming MMTEB release. However, since most of the time of these models is spent in the encode step of the model the primary speed-up is reducing the size to a reasonable level while maintaining high power. Again restating:

Atm. we are focusing on speeding up task where we believe and have estimates it takes the longest. This is notably for retrieval tasks (see https://github.com/embeddings-benchmark/mteb/issues/836) previously we e.g. sped up clustering by 7x-100x (see https://github.com/embeddings-benchmark/mteb/pull/481 and https://github.com/embeddings-benchmark/mteb/issues/835) and download speeds https://github.com/embeddings-benchmark/mteb/issues/651.

embeddings-benchmark / mteb

Thoughts on accelerating the evaluation speed: supporting DDP/FSDP model inference + multi-task evaluation in parallel #883