Open ankush13r opened 2 months ago
theoretically, it should be possible to use HF_HUB_OFFLINE=1
and load from local cache or local path (if matching the dataset checkpoint dir). Since the base class makes use of dataset.load_dataset()
here https://github.com/bigcode-project/bigcode-evaluation-harness/blob/f0b81a9d079289881bd42f509811d42fe73e58cf/bigcode_eval/base.py#L28
But, I couldn't find any way add the path for the dataset. As you can observe here https://github.com/search?q=repo%3Abigcode-project%2Fbigcode-evaluation-harness%20DATASET_PATH&type=code the dataset path is a constant variable defined directly in the code.
those are the checkpoint dirs from the huggingface hub. so clone the dataset repo to be that exact path locally and the load_dataset
function will try local first.
Hello, thank for your response. I have tried what you said, but i hasn't worked for me. I let you an example that I had used to run the evaluation. I have also downloaded the dataset in /home/user/dataset.
export HF_DATASETS_CACHE=/home/user/dataset
export HF_HUB_OFFLINE=1
accelerate launch main.py \
--model /path/to/the/model \
--tasks mbpp \
--max_length_generation 1500 \
--temperature 1.2 \
--do_sample True \
--n_samples 100 \
--batch_size 10 \
--allow_code_execution \
--save_generations
The error I'm getting is:
AttributeError: 'MBPP' object has no attribute 'dataset'
/gpfs/home/bsc/bigcode-evaluation-harness/bigcode_eval/base.py:30: UserWarning: Loading the dataset failed with Couldn't reach the Hugging Face Hub for dataset 'mbpp': Offline mode is enabled.. This task will use a locally downloaded dataset, not from the HF hub. This is expected behavior for the DS-1000 benchmark but not for other benchmarks!```
that sees to be an issue with the actual test in this case. MBPP used to have a vanity dataset name on the hub. so there is no org. so maybe it works if you have the /mbpp/ dataset folder on the same level as main.py
the error is actually misleading since it doesn't do anything afterwards. it is just a warning for the specific ds1000 benchmark and just means the dataset couldn't be loaded. It sorta surpresses the real error message that is more helpful.
Thanks it worked, I think it will work with all kind of tasks, having datasets in local machine. I would like to know if there is a way to change the path for these datasets, Since we need to save in other folder.
Maybe symlinks? But I am not too familiar with how the load_dataset()
function resolves these. Perhaps there is a way to use the HF Hub Cache instead. As that can be pointed anywhere
Perfect, I'll figure it out. Thanks again!
Hello, Is there currently a way to evaluate a model using a dataset from a local path, instead of fetching it directly from HuggingFace? We're working in a cluster environment without internet access, and we need to evaluate the model locally.
If this feature isn't available yet, it would be a great enhancement to consider. Implementing a solution that accepts a local dataset would allow evaluations to be run offline. A potential approach could involve adding a new script argument, such as --datasets-path, so the dataset can be loaded directly from the specified location.