How to use local dataset(avoid downloading from huggingface)

EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval

https://lmms-lab.github.io/

Other

1.33k stars 91 forks source link

How to use local dataset(avoid downloading from huggingface) #179

Open xsgldhy opened 1 month ago

xsgldhy commented 1 month ago

Thanks for your contribution. I have already downloaded the videochatgpt dataset to a directory by huggingface-cli download lmms-lab/VideoChatGPT --repo-type dataset --local-dir ., and I have export HF_HOME=that_directory and change the dataset_path to that directory of _default_template_yaml under videochatgpt task dir as well. But the program encounters an error when getting the task object (return TASK_REGISTRY[task_name](model_name=model_name)) Here is my logging info:

xsgldhy commented 1 month ago

I also set token=False inside the _default_template_yaml

Luodian commented 1 month ago

I think you can explore some HF settings, like export HF_DATASETS_OFFLINE=1 to set the local dataset path.

Here's the actual download & load process, and if you can mock the offline load in your local environment using a few lines of code. You would also sucessfully load it with lmms-eval.

Hope you can resolve this and provide your insights to us!

https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/3d4884ae16ff3189a5c1dd6bac44265d05ef6a97/lmms_eval/api/task.py#L853

xsgldhy commented 1 month ago

Thanks for your response! After commenting this line of code, https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/3d4884ae16ff3189a5c1dd6bac44265d05ef6a97/lmms_eval/api/task.py#L779 and manually mange the cache_path, I can successfully loading the local dataset. It appears that the code will always execute snapshot_download, attempting to download the dataset. I humbly suggest considering the addition of another option(by exporting env or adding cli_args) to load the data from a local folder.

ecoli-hit commented 1 day ago

Thanks for your response! After commenting this line of code,

https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/3d4884ae16ff3189a5c1dd6bac44265d05ef6a97/lmms_eval/api/task.py#L779

and manually mange the cache_path, I can successfully loading the local dataset. It appears that the code will always execute snapshot_download, attempting to download the dataset. I humbly suggest considering the addition of another option(by exporting env or adding cli_args) to load the data from a local folder.

Hi, I am facing the same problem but I am not clear on how to manage the cache_path to the local directory, could you share your change? Thanks a lot!