From PR 43 - Githubissues

JonasGeiping commented 8 months ago

Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:

166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s] 167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24'] 168 Traceback (most recent call last): 169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch 170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining") 171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher 172 metrics = main_fn(cfg, setup) 173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process 174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl) 175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus 176 return _load_from_hub(cfg_data, data_path) 177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub 178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"] 179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem 180 raise NotImplementedError("Subclasses of Dataset should implement getitem.") 181 NotImplementedError: Subclasses of Dataset should implement getitem. 182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Have you encountered similar issues?

Thank you

Originally posted by @shiwenqin in https://github.com/JonasGeiping/cramming/issues/43#issuecomment-1966722702

JonasGeiping commented 8 months ago

Is this still an issue?

JonasGeiping commented 8 months ago

Problem might be related to differences in versions of the datasets package. The fix from the PR is only necessary for newer releases, and a problem for older ones.

shiwenqin commented 8 months ago

I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.

The work-around I made to this problem is to change the line from

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

to

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.

keeeeenw commented 8 months ago

I originally faced the same error msg with @euclaise , and after the proposed change is applied, I instead face this error message.

The work-around I made to this problem is to change the line from

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

to

tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, "default", streaming=cfg_data.streaming, cache_dir=data_path)["train"]

And it solves the problem, however I'm not familiar with the datasets package so I'm not sure if it is the right fix.

This worked for me! For anyone interested, I am running Python 3.10, datasets 2.18.0, Ubuntu 22.0.

JonasGeiping commented 7 months ago

This fix is now included in commit 2875a3b3ad0dbd5fab0438d12ccba61b512873e5 thanks!

JonasGeiping / cramming

From PR 43 #44