Open bourdoiscatie opened 1 day ago
Hi @bourdoiscatie!
This looks like an TimeoutError that occurs when the system takes too long to download a dataset. MLSUM takes a few GB. Without knowing what hardware you ran this on, I could only suggest that we check for internet connectivity, and rerun the task.
Here's my successful run on a linux machine:
Maybe @imenelydiaker or @KennethEnevoldsen have seen this error before?
Never seen this error before, looks like an internet issue, but it can be anything related to the network you're using. As @isaac-chung mentioned it MLSUM is quite a big dataset that requires some time to load.
Thank you for your feedback @isaac-chung @imenelydiaker For my evaluation I'm using an A100 on a remote server to which I was given access for this purpose. Unfortunately, I don't have control over the server's internet connection. So I'll probably download this dataset on my side and then upload it to the server. Is it enough to put it in HF's cache, or do I need to put it in a particular place so that the MTEB library can find it later?
@bourdoiscatie for reference I'm using an A10 on a remote server, and the dataset was downloaded into the default HF cache location.
Thanks for the information, I should be able to manage with all that 🤗 I close the issue.
For those who have the same problem, it seems to be due to the datasets library since version 3.x. https://github.com/huggingface/datasets/issues/7175 Downgrade the library seems to be a temporary solution.
Hi ! I'm Quentin from HF :)
Unfortunately we had to limit our support of script-based datasets for obvious security reasons, and apparently it made some issues related to relying on bad hosts resurface :/ Have you considered uploading the data on HF instead (ideally in Parquet to avoid using a dataset script) ?
Looking at the code, I realize that the train split that is massive is not even used in practice : https://github.com/embeddings-benchmark/mteb/blob/bac8bd7212a90fb814d5c92e4d39ee12e92e5fe7/mteb/tasks/Clustering/multilingual/MLSUMClusteringP2P.py#L80 Wouldn't it be more appropriate to load only the validation and test splits to speed things up? And as Quentin points out, possibly host these two splits on the Hub.
Thanks @lhoestq and @bourdoiscatie for pointing this out.
The best solution (imo) is to re-upload the dataset to HF using parquet, validation
and test
splits are also generated using a script. If we want to avoid this error again, better re-upload to a supported format.
We're working on it and will let you know when it's fixed, thank you. 🙏
Hi!
I've just trained an embedding model in French and would like to test it on the MTEB_FR. I used the following code:
and everything ran fine until
MLSUMClusteringP2P
, where I got the following error:I then ran the code on each individual task and everything ran, with the exception of
MLSUMClusteringP2P
but also forMLSUMClusteringS2S
, where I received the same error. This suggests to me that there may be a problem with these two datasets, but I can't say what it is. I haven't found any other issues with this problem.Note that I'm using version 1.16.1 of the library.
If you can enlighten me on this point, I'd be very grateful 🙏