PattanaikL / GeoMol

MIT License
154 stars 43 forks source link

RuntimeError: Cannot re-initialize CUDA in forked subprocess (solved) #8

Closed ycremar closed 2 years ago

ycremar commented 2 years ago

Just in case anyone else has the same issue, I received the following error when during training.

Starting training...
  0%|                                                                         | 0/625 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/grads/e/ethanycx/workspace/GeoMol/train.py", line 73, in <module>
    train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/training.py", line 18, in train
    for i, data in tqdm(enumerate(loader), total=len(loader)):
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 187, in __getitem__
    data = self.get(self.indices()[idx])
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/featurization.py", line 74, in get
    data.edge_index_dihedral_pairs = get_dihedral_pairs(data.edge_index, data=data)
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in get_dihedral_pairs
    keep = [t.to(device) for t in keep]
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in <listcomp>
    keep = [t.to(device) for t in keep]
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Versions: torch==1.7.1, torch_geometric==1.7.0

This seems to be a Pytorch issue with the DataLoader. I fixed the issue by inserting the following lines at line 18 in train.py (and indent later lines accordingly):

if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn', force=True)

and changing line 240 in featurization.py to num_workers=1,.