Open TTTTao725 opened 5 months ago
"/work/raw_datasets/lexdk-raw.tar.gz"
"Usage: hplt_to_dolma_format.py f1.jsonl.zst f2.jsonl.zst ... output_directory"
/directory-with-scandi-wiki.tar.gz-inside/
Cannot access gated repo for url https://huggingface.co/api/datasets/NbAiLab/NCC. Access to dataset NbAiLab/NCC is restricted and you are not in the authorized list.
@TTTTao725 The raw datasets are on a separate mount. It seems that maybe we should standardise the scripts in how they read from the raw data. Some of the scripts download directly from HuggingFace. In that case I don't think we should store the raw data as well.
I'm trying to run the conversion scripts to correct all the timestamps, but couldn't find the following raw datasets.