allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

Lock for writing to the cache files #228

Open eugene-yang opened 1 year ago

eugene-yang commented 1 year ago

I sometimes have multiple processes or multiple machines accessing the same storage cluster that hosts the cache directory of ir_datasets. If multiple processes decide to download the same dataset at the same time, they start writing to the same file and eventually crash. It would be nice if there is a locking mechanism that prevents more than one process from writing to the same file and asking other processes to wait.

seanmacavaney commented 1 year ago

Thanks for reporting! I’ll look into it.

bpiwowar commented 1 year ago

Yes I have the same issue with downloading - but also with processes like building the docstore.