THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
604 stars 43 forks source link

Loading local datasets with split=‘test’ #65

Open yichen0104 opened 3 months ago

yichen0104 commented 3 months ago

I’m trying to evaluate a new model with LongBench and would like to load the datasets stored locally (downloaded and unzipped directly from HuggingFace). But whenever I’m reading the data with flag split=‘test’ in pred.py (say we are reading xxx.jsonl within the loop, the line is modded as data = load_dataset("json", data_files="/some/dir/xxx.jsonl", split="test") ), it will return a ValurError: Unknown split “test”. Should be one of [‘train’]. Is there any pre-processing I should perform on the downloaded data? Thanks in advance.

bys0318 commented 3 months ago

If you have downloaded the dataset files locally, you can load them via:

data = [json.loads(line) for line in open("/some/dir/xxx.jsonl", "r", encoding="utf-8")]