The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting split="train[0:2000]") is not helping since slicing happens only after full download
Suggestions:
download just the first files of the datasets.
replace c4 with allenai/c4: load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
replace wiki with wikitext2. load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting
split="train[0:2000]")
is not helping since slicing happens only after full download Suggestions:allenai/c4
:load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
load_dataset("wikitext", "wikitext-2-raw-v1", split="train")