Closed euclaise closed 7 months ago
Thanks!
Hi, Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:
166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s] 167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24'] 168 Traceback (most recent call last): 169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch 170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining") 171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher 172 metrics = main_fn(cfg, setup) 173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process 174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl) 175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus 176 return _load_from_hub(cfg_data, data_path) 177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub 178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"] 179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem 180 raise NotImplementedError("Subclasses of Dataset should implement getitem.") 181 NotImplementedError: Subclasses of Dataset should implement getitem. 182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Have you encountered similar issues?
Thank you
You attempt to load the
train
configuration of the huggingface dataset, buttrain
is a split rather than a configuration for pile-readymade, so it complains:I've changed it to request the train split instead of a train configuration.