Open Muennighoff opened 2 weeks ago
Same here
INFO:mteb.evaluation.evaluators.Image.Any2AnyRetrievalEvaluator:Encoding Queries.
/env/lib/conda/gritkto4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 104 worker processes in total. Our suggested max number of worker in current system is 23, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
thanks for raising this. Haven't found a good number or way that works across machines since turning all into dataloaders.
We can perhaps do something like num_workers=min(math.floor(os.cpu_count() / 2), 16)
so that it doesnt't freeze for machines with massive number of cpus?
I still get occasional freezes when running mieb tasks despite the changes to never use more than 16 workers. 🤔 Here it is mentioned that multiple workers may not help if the data is already loaded https://discuss.pytorch.org/t/dataloader-with-num-workers-1-hangs-every-epoch/20323/16 ; I think the data (https://github.com/embeddings-benchmark/mteb/blob/a449b244ed964ba277ef83047d5f53fa588045c0/mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py#L44) is already loaded so it gets copied num worker times which may lead to freezes as it runs out of memory?
Maybe it is worth checking that
Problem is with
num_workers=math.floor(os.cpu_count() / 2),
I think. The run froze for me shortly after.