autonomousvision / carla_garage

[ICCV'23] Hidden Biases of End-to-End Driving Models
MIT License
203 stars 16 forks source link

RuntimeError: Trying to resize storage that is not resizable #12

Closed HiPatil closed 10 months ago

HiPatil commented 10 months ago

Hi, I downloaded the dataset using the script provided and was trying to reproduce the results by training the model. While training, an error occurred, and I am not sure what is causing this. Would really appreciate the help.

Root Cause (first observed failure):
[0]:
  time      : 2023-10-12_15:09:48
  host      : scc-204.scc.bu.edu
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1566171)
  error_file: /scratch/1977161.1.academic-gpu/torchelastic_fzaatik1/42353467_ccys4nyt/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "train.py", line 625, in main
      trainer.train(epoch)
    File "train.py", line 884, in train
      for i, data in enumerate(tqdm(self.dataloader_train, disable=self.rank != 0, ascii=True, desc=f"Epoch: {epoch}")):
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/tqdm/std.py", line 1183, in __iter__
      for obj in iterable:
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
      data = self._next_data()
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1356, in _next_data
      return self._process_data(data)
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
      data.reraise()
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
      raise exception
  RuntimeError: Caught RuntimeError in DataLoader worker process 5.
  Original Traceback (most recent call last):
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
      data = fetcher.fetch(index)
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
      return self.collate_fn(data)
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 160, in default_collate
      return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 160, in <dictcomp>
      return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 149, in default_collate
      return default_collate([torch.as_tensor(b) for b in batch])
    File "/projectnb/rlvn/students/hipatil/miniconda3/envs/garage/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 140, in default_collate
      out = elem.new(storage).resize_(len(batch), *list(elem.size()))
  RuntimeError: Trying to resize storage that is not resizable
Kait0 commented 10 months ago

Hm seems to be something related to the pytorch dataloader. Have never seen this before. Perhaps you can set --cpu_cores 0 to turn of threading to see if that is the problem. Does it happen immediately or just after a while of training?

HiPatil commented 10 months ago

It happens after a while of training. I will set cpu_cores 0 and let you know.

Kait0 commented 10 months ago

if it happens randomly after multiple epochs or so you can also just try to resume the training with --continue_epoch 1 and --load_file /path/to/latest_model.pth I do observe that trainings can sometimes just crash for random reasons and need to be resumed. Compute clusters can be unstable.