Security features from the Hugging Face datasets library

lhoestq commented 9 months ago

In the next version of datasets we are adding a new argument trust_remote_code to let users decide or not to trust the dataset scripts in python on Hugging Face. The default in the next release will still be trust_remote_code=True, but in the coming months we plan to do a major release and set it to False by default.

Most datasets in lm-evaluation-harness are defined on HF using dataset scripts, and may require passing trust_remote_code=True.

Those datasets could also be converted to e.g. JSONL or Parquet datasets on HF (e.g. like Muennighoff/babi).

Let me know if you have questions on this change and if I can help making a smooth transition. Ideally this should make lm-evaluation-harness safer for people using datasets defined with python code.

cc @haileyschoelkopf

haileyschoelkopf commented 9 months ago

Thanks, @lhoestq ! Will see about the smoothest way to make this non-invasive for users while still having them acknowledge that they may be running untrusted code.

veekaybee commented 6 months ago

I'm happy to help on this issue if it's something you'd still like for lm-evaluation-harness to be able to bump datasets.

I see we include it as a parameters when initialing the constructor for HFLM.

The steps I had in mind:

Set the flag to True
If running a huggingface model, include a Y/n at the CLI run level after the user triggers a model run.

We mentioned this may be invasive for users and introduces an extra level of friction that they don't usually need to respond to cli prompts to continue when using lm-evaluation-harness - perhaps a log line entry is enough here?

Let me know what you think @haileyschoelkopf

haileyschoelkopf commented 6 months ago

Hi @veekaybee , this'd be awesome! I agree that since the current behavior is to run datasets from HF datasets, remote code or no, that manually setting it to True so we can update datasets wouldn't be a regression in safety, but it'd be nice if there's a way to one-time opt-in per env install to trust_remote_code or something, and subsequently afterward override to trust_remote_code=True as you're suggesting.

Maybe @lhoestq has suggestions for how to non-invasively incorporate this while not making it too much friction for users?

veekaybee commented 6 months ago

I'm taking a look at how some other libraries handle this:

For OpenLLM, it's passed as an environment variable on invocation and here
vLLM passes it as a dataclass of config args

Perhaps we could do some variation of the first one, where we indicate that users should set it in the docs and then check the environment variable when a simple evaluation for HFLM is instantiated ?

haileyschoelkopf commented 6 months ago

Hmmm... I think for models, allowing it to be passed through --model_args as we do now for vllm and HFLM doesn't seem like a problem to users (though if there's something better that can switch its default to True like an env var open to it for sure!)--but it seems nonideal to have a --dataset_args or --trust_remote_code standalone switch for Datasets' trust_remote_code kwarg that people will possibly need to always have on for certain tasks.

lhoestq commented 6 months ago

It would be great to make the list of tasks that require trust_remote_code=True, maybe using a code like this ?

from datasets.inspect import get_dataset_infos

task_that_require_trust_remote_code = []
for task in tasks:
    try:
        get_dataset_config_names(task.DATASET_PATH, trust_remote_code=False)
    except ValueError e:
        if "trust_remote_code=True" in str(e):
            task_that_require_trust_remote_code.append(task)

Maybe for all the tasks that we can already trust we can add trust_remote_code: true in their YAML (e.g. the ones maintained by EAI or HF), and require --trust_remote_code for the rest ?

Note that there is also an environment variable HF_DATASETS_TRUST_REMOTE_CODE that you can set to "1" to always trust remote code by default, if it can help. It can also be dynamically set using datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True

veekaybee commented 6 months ago

I want to summarize so far for my own understanding, let me know if I'm missing or mischaracterizing anything:

Datasets is HuggingFace's library for loading datasets for model training/tuning. Some of these datasets require running extra scripts to fully populate the data.

Datasets The current behavior for load_dataset , which loads files + any scripts to populate files, is that the trust_remote_code is initialized in the object constructor and currently set to True.

There is a change in the new version that would allow users to pass this explicitly. when calling load_dataset, and it defaults to True, as before, however it will be set to False.

LM-Harness

Within datasets used for evaluation, such as squadv2, we call the datasets load_dataset function, example here.

This means when we run evaluation, regardless of what model type we are evaluating, the task is run and the script for the task is run. For example:

lm_eval --model hf \
     --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks squadv2 \
    --batch_size 16

will trigger this issue or warning (if squadv2 is a dataset with scripts that need to be executed)

In order to understand when this would be a user issue, we'd like to understand which datasets actually require True, which we can get from dataset metadata, and based on that, we could potentially pass it here, like this:

lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B, trust_remote_code=True \
    --tasks squadv2 \
    --batch_size 16

which would mean it gets picked up from the configuration file.

haileyschoelkopf commented 6 months ago

@veekaybee This is correct!

@lhoestq thank you, that environment variable is quite helpful--exactly what I was hoping might exist!

veekaybee commented 6 months ago

In this case, then, would it make sense to do the following:

In the config:

Add a new field in the task YAML files based on the inventory suggested above, trust_remote_code to the TaskConfig dict

Within the Model constructor:

Set the parameter in the constructor to None
Set trust_remote_code here, to be something like:

datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True
self.trust_remote_code = trust_remote_code if trust_remote_code is not None else os.envget('HF_DATASETS_TRUST_REMOTE_CODE')

Add a log warning, "you are trusting remote code for this task/model run" if the code in 2 gets called

A couple things I'm not clear on: Should we set the environment variable from the TaskConfig directly or once we instantiate the class? And also, how the TaskConfig dict gets passed to the HFLM class- can we call it directly?

haileyschoelkopf commented 6 months ago

So to clarify, both transformers and datasets now have trust_remote_code=True but that doesn't necessarily mean we have to link the behaviors. We currently support trust remote code (and default to False) for HF transformers models, but don't support specifying this option for datasets. Related to this: LM classes are never passed or provided arguments based on TaskConfig classes, as TaskConfigs only are used for task / dataset related arguments, not model-related arguments.

In the config:

We can add trust_remote_code to dataset_kwargs which will get passed to load_dataset() , so don't need a whole extra field!
```
dataset_kwargs:
trust_remote_code: true
```
This should behave as anticipated, I think, as well in the absence of trust_remote_code in the config--it should then work based on the user's HF_DATASETS_TRUST_REMOTE_CODE.

In the Model: I think it's okay to preserve the current behavior, at least as one option (allowing --model_args trust_remote_code=True/False). I think if we do add a --trust_remote_code CLI flag for HF datasets, it'd make sense for the CLI flag turning on trust_remote_code to also override and pass trust_remote_code=True to the model constructors though.

lhoestq commented 3 months ago

Hi ! FYI the datasets release that requires explicit trust_remote_code=True for datasets with a python loading script will be out today. Feel free to ping me if you encounter any issue with your CI or from users !

EleutherAI / lm-evaluation-harness

Security features from the Hugging Face datasets library #1135