Closed Vipitis closed 5 months ago
if I'm not mistaken, the only dataset that has arbitrary code in the harness now and could need this flag in the future is MultiPL-E?
It's a bit different from models where users can pass any model they want, whereas we have a limited number of carefully selected tasks so I'm not sure we need the flag for all tasks (we can add it just for MultiPL-E when it's a requirement if the authors don't update their loading script by then).
I haven't checked myself. And moving all datasets to parquet export and no builder script is the better idea anyway. Perhaps adding a notice to the new task template might be an easier "fix".
If there is just a single case it can by bypassed using the env flag much easier.
I additionally came across this with the task I am developing. But moving to parquet is on my to-do list either way. Feel free to close this issue and reject the draft PR.
sounds good, I also talked to MultiPL-E authors and they will remove the python code in the dataset so it wouldn't require the flag.
Following https://github.com/huggingface/datasets/issues/6400 and the 2.16 release of
datasets
you now requiretrust_remote_code=True
for datasets with a custom builder config.if I read the code correctly, the
--trust_remote_code
flag gets passed to model and tokenizer already. It does not reach theload_dataset
method here.That seems an easy fix and I can open a PR next week. Addressing it for the finetuning examples would be more edits.