Open KennethEnevoldsen opened 1 month ago
Adding a bit more here:
The current default format for the retrieval task (assumed by the load_data
function) does not allow us to load the dataset usingdatasets.load_dataset(..., trust_remote_code=False)
, which introduced a safety concern and causes bugs (e.g. #1363)
In #1090 we introduce an alternative format, which is both more efficient and can be loaded using trust_remote_code=False
.
(see #1308 for an example of a dataset PR that uses the old format originally and moves to the new)
Currently, we have to reimplement the load_data() (e.g. here) function whenever we add a retrieval dataset using the recommended format (#1090). We should change the default to correspond to the recommended.
For older datasets we can either update on huggingface and/or add a custom load_data function.