data loading timing and disk use

The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting split="train[0:2000]") is not helping since slicing happens only after full download Suggestions:

download just the first files of the datasets.
replace c4 with allenai/c4: load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
replace wiki with wikitext2. load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

Infini-AI-Lab / Sequoia

data loading timing and disk use #4