JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

TypeError: _new_shared() got an unexpected keyword argument 'device' #8

Closed wccccp closed 1 year ago

wccccp commented 1 year ago

Error executing job with overrides: [] Traceback (most recent call last): File "/tmp/pycharm_project_41/cramming-main/pretrain.py", line 153, in launch cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining") File "/tmp/pycharm_project_41/cramming-main/cramming/utils.py", line 64, in main_launcher main_fn(cfg, setup) File "/tmp/pycharm_project_41/cramming-main/pretrain.py", line 45, in main_training_process for step, batch in iterable_data: File "/tmp/pycharm_project_41/cramming-main/cramming/backend/utils.py", line 263, in next batch = next(self.dataset_iterator) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise raise exception TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py", line 42, in call return self.torch_call(features) File "/tmp/pycharm_project_41/cramming-main/cramming/backend/utils.py", line 221, in torch_call storage = elem._storage()._new_shared(len(examples) 8 elem.shape[0], device=elem.device) # 8 for byte->long TypeError: _new_shared() got an unexpected keyword argument 'device'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Process finished with exit code 1

JonasGeiping commented 1 year ago

oof that's a good one, can you tell me your pytorch+huggingface transformers package versions and python version?

This is is an error coming in through an older version of https://github.com/pytorch/pytorch/blob/523d4f2562580a6cd9888cfbc9b9ae8ed2a61ed1/torch/storage.py#L215 from the pytorch source.

JonasGeiping commented 1 year ago

It appears you are on a torch version less than 1.12. I've clarified this in the installation instructions, thanks for bringing it up!

Feel free to re-open this issue, if this does not solve your problem.