Option to remove cached dataset files on large runs

isaac-chung commented 6 days ago

When running all retrieval tasks, a machine can easily run out of disk space, as loading a dataset stores the dataset files in a cache directory (usually ~/.cache/huggingface/datasets). e.g.

import mteb
all_retrieval_tasks = mteb.get_tasks(task_types=["Retrieval"])
for task in all_retrieval_tasks:  
    task.load_data()
...

Suggestion

Add an option within evaluate to call the dataset's cleanup_cache_files method, or
implement __exit__() (call cleanup_cache_files) for AbsTask to be able to use the task as a context manager

CC @imenelydiaker (related to the script we have)

KennethEnevoldsen commented 6 days ago

An option in the CLI might simply be to do:

mteb run ... --disable-datasets-caching

Using the following:

from datasets import disable_caching
disable_caching()

We might additionally add the arguments:

eval = mteb.MTEB(...)

eval.run(..., automatically_clean_up_cache=True) # on or off by default? On would be more stable but also more invasive

Which will automatically clean up if there is not enough space

imenelydiaker commented 5 days ago

Would go for an option in the CLI also!

embeddings-benchmark / mteb

Option to remove cached dataset files on large runs #984

Suggestion