Closed westonli-thu closed 4 months ago
@westonli-thu I think that this issue is due to the new release of datasets library v2.20.0
released on 13th june,2024. (2 days back).
Remove default trust_remote_code=True by @lhoestq in https://github.com/huggingface/datasets/pull/6954 datasets with a python loading script now require passing trust_remote_code=True to be used
It makes it mandatory to use this flag while loading the datasets that have a custom loading script to have this flag.
Release changelog: https://github.com/huggingface/datasets/releases/tag/2.20.0 PR: https://github.com/huggingface/datasets/pull/6954
Thanks for the issue, I will be opening a bug fix PR for this soon.
@westonli-thu in case you need a temporary work around while waiting for @henilp105 's PR what has worked for me is to set the HF_DATASETS_TRUST_REMOTE_CODE
environment variable to 1 (ref).
We should probably allow users to specify trust_remote_code=True within the CLI and for MTEB(...).
Wondering whether it should default to False
(@Muennighoff ?). In terms of safety, that is the best choice, but most users would set the flag.
We should probably allow users to specify trust_remote_code=True within the CLI and for MTEB(...).
Wondering whether it should default to
False
(@Muennighoff ?). In terms of safety, that is the best choice, but most users would set the flag.
I'd not make it a kwarg accessible to users but just set it to True
wherever we load datasets as we review every dataset that gets added here, which should be our safety check? (due to the revision it can also not be made unsafe behind the scenes afaict) 🤔
For the datasets library nobody manually reviews datasets & their scripts uploaded to Hugging Face, hence the True
default is more risky there.
Ahh, yes, that is indeed correct. So simply checking the dataset when specifying true should solve it
Hi ! I'm Quentin from HF Datasets.
I'd suggest you to simply convert your datasets to a format that doesn't require trust_remote_code like Parquet. Most features on HF are disabled for datasets with trust_remote_code for security reasons anyway.
e.g. I have opened a PR https://huggingface.co/datasets/mteb/amazon_counterfactual/discussions/2 to convert mteb/amazon_counterfactual
to Parquet using the command
datasets-cli convert_to_parquet --trust_remote_code mteb/amazon_counterfactual
Thank @lhoestq. I think that is probably the best solution. However, I don't believe we control all of the datasets, and some we can't redistribute so for those we will still have the exception (but that will be a small subset).
e.g. I have opened a PR https://huggingface.co/datasets/mteb/amazon_counterfactual/discussions/2 to convert
mteb/amazon_counterfactual
to Parquet using the command
I have merged this in
@KennethEnevoldsen We have about 232 tasks (about 109 unique datasets) which need trust_remote_code
out of 546 (by checking the required file naming for remote code execution.) what could be an appropriate method to patch this.
@henilp105 I just tried:
datasets-cli convert_to_parquet --trust_remote_code strombergnlp/danfever
This creates a branch which you can then use in the future to:
>>> from datasets import load_dataset
>>> ds = load_dataset("strombergnlp/danfever", trust_remote_code=False, download_mode='force_redownload')
Downloading data: 100%|███████████████████████| 722k/722k [00:01<00:00, 701kB/s]
Generating train split: 100%|████| 6407/6407 [00:00<00:00, 194381.88 examples/s]
So we actually don't even have to accept the branches. The only thing this requires is downloading all the files and converting them
Hmm I believe the second time it ran without error because you've trusted this dataset script once already using the CLI command. If you clear your cache at ~/.cache/huggingface/modules
it will re-ask you to trust_remote_code
Ahh damn, I thought it would just default to the parquet branch if available. Is there any reason why we wouldn't want that?
edit: In our case, it, of course it of course also does not guarantee the revision, so a merge is required. edit: a solution to that seems to be:
ds = load_dataset("strombergnlp/danfever", trust_remote_code=False, download_mode='force_redownload', revision="d478a3c6e40b497e1f7d2bedef54825658bc7de6")
# the revision is the newly created branch
edit: Seems like converting retrieval datasets (e.g. RAR-b/alphanli) this approach fails due to multiple configs. edit: the automatic conversion script seems far from solving many of the datasets above.
From the comments above, it might be best to add the "trust_remote_code": true for all datasets that are not easily converted. However, discourage it for future additions, e.g., using a test. We can then come back and fix/re-upload older sources.
edit: Seems like converting retrieval datasets (e.g. RAR-b/alphanli) this approach fails due to multiple configs.
It looks like it only converted one config smh :/ Did you get an error message ?
On my side I haven't had issues with the CLI to convert to Parquet a dataset with multiple config, maybe @albertvillanova knows more ?
From the comments above, it might be best to add the "trust_remote_code": true for all datasets that are not easily converted. However, discourage it for future additions, e.g., using a test. We can then come back and fix/re-upload older sources.
Sounds good to me !
Hello, I just ran the CLI convert_to_parquet for "RAR-b/alphanli" (with multiple configs) with success: https://huggingface.co/datasets/RAR-b/alphanli/discussions/2
huggingface-cli login
datasets-cli convert_to_parquet RAR-b/alphanli --trust_remote_code
You have all the information about the command in the docs: https://huggingface.co/docs/datasets/cli#convert-to-parquet
For security reasons, the best solution is to convert the datasets to Parquet (then no need to pass trust_remote_code
because no code needs to be executed locally).
If the datasets are third-party repositories, you should not blindly trust them. I would recommend to pass trust_remote_code=True
only if:
revision
parameter to load_dataset
pointing to a specific commit: the repo may not contain malicious code now, but it could in the futureAlternatively, you could open a pull request to convert to Parquet in the third-party repository, and pass the PR reference as revision
parameter. For example, you can reference the PR I opened in "RAR-b/alphanli" even if it is not merged: https://huggingface.co/datasets/RAR-b/alphanli/discussions/2
ds = load_dataset("RAR-b/alphanli", revision="refs/pr/2")
ds = load_dataset("RAR-b/alphanli", revision="c7b0f6cd")
The fact that we can use the PR revision is great, it makes everything more stable on our end without requiring reupload or actions from the maintainers.
It looks like it only converted one config smh :/ Did you get an error message ?
Yup, I got an error (for some datasets it did push, though)
Hello, I just ran the CLI convert_to_parquet for "RAR-b/alphanli" (with multiple configs) with success
Hmm, odd it might have been an issue with the version
I will add a fix for this in PR #974 due to failing tests
Hi, I just run the meter eval today and found this issue:
This was not occur in the past few days. Is that anything I should modify?