embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.95k stars 271 forks source link

[MIEB] Too many workers? #1331

Open Muennighoff opened 2 weeks ago

Muennighoff commented 2 weeks ago
Task: STS17MultilingualVisualSTS, split: test, subset: ko-ko. Running...
/env/lib/conda/gritkto4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 104 worker processes in total. Our suggested max number of worker in current system is 23, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

Problem is with num_workers=math.floor(os.cpu_count() / 2), I think. The run froze for me shortly after.

Muennighoff commented 2 weeks ago

Same here

INFO:mteb.evaluation.evaluators.Image.Any2AnyRetrievalEvaluator:Encoding Queries.
/env/lib/conda/gritkto4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 104 worker processes in total. Our suggested max number of worker in current system is 23, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
gowitheflow-1998 commented 2 weeks ago

thanks for raising this. Haven't found a good number or way that works across machines since turning all into dataloaders. We can perhaps do something like num_workers=min(math.floor(os.cpu_count() / 2), 16) so that it doesnt't freeze for machines with massive number of cpus?

Muennighoff commented 1 week ago

I still get occasional freezes when running mieb tasks despite the changes to never use more than 16 workers. 🤔 Here it is mentioned that multiple workers may not help if the data is already loaded https://discuss.pytorch.org/t/dataloader-with-num-workers-1-hangs-every-epoch/20323/16 ; I think the data (https://github.com/embeddings-benchmark/mteb/blob/a449b244ed964ba277ef83047d5f53fa588045c0/mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py#L44) is already loaded so it gets copied num worker times which may lead to freezes as it runs out of memory?

Maybe it is worth checking that