embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.61k stars 211 forks source link

Option to remove cached dataset files on large runs #984

Open isaac-chung opened 6 days ago

isaac-chung commented 6 days ago

When running all retrieval tasks, a machine can easily run out of disk space, as loading a dataset stores the dataset files in a cache directory (usually ~/.cache/huggingface/datasets). e.g.

import mteb
all_retrieval_tasks = mteb.get_tasks(task_types=["Retrieval"])
for task in all_retrieval_tasks:  
    task.load_data()
...

Suggestion

  1. Add an option within evaluate to call the dataset's cleanup_cache_files method, or
  2. implement __exit__() (call cleanup_cache_files) for AbsTask to be able to use the task as a context manager

CC @imenelydiaker (related to the script we have)

KennethEnevoldsen commented 6 days ago

An option in the CLI might simply be to do:

mteb run ... --disable-datasets-caching

Using the following:

from datasets import disable_caching
disable_caching()

We might additionally add the arguments:

eval = mteb.MTEB(...)

eval.run(..., automatically_clean_up_cache=True) # on or off by default? On would be more stable but also more invasive

Which will automatically clean up if there is not enough space

imenelydiaker commented 5 days ago

Would go for an option in the CLI also!