Closed JonasGeiping closed 7 months ago
Is this still an issue?
Problem might be related to differences in versions of the datasets
package. The fix from the PR is only necessary for newer releases, and a problem for older ones.
I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.
The work-around I made to this problem is to change the line from
tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
to
tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.
I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.
The work-around I made to this problem is to change the line from
tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
to
tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.
This worked for me! For anyone interested, I am running Python 3.10, datasets 2.18.0, Ubuntu 22.0.
This fix is now included in commit 2875a3b3ad0dbd5fab0438d12ccba61b512873e5 thanks!
Have you encountered similar issues?
Thank you
Originally posted by @shiwenqin in https://github.com/JonasGeiping/cramming/issues/43#issuecomment-1966722702