Just in case anyone else has the same issue, I received the following error when during training.
Starting training...
0%| | 0/625 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/grads/e/ethanycx/workspace/GeoMol/train.py", line 73, in <module>
train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
File "/home/grads/e/ethanycx/workspace/GeoMol/model/training.py", line 18, in train
for i, data in tqdm(enumerate(loader), total=len(loader)):
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 187, in __getitem__
data = self.get(self.indices()[idx])
File "/home/grads/e/ethanycx/workspace/GeoMol/model/featurization.py", line 74, in get
data.edge_index_dihedral_pairs = get_dihedral_pairs(data.edge_index, data=data)
File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in get_dihedral_pairs
keep = [t.to(device) for t in keep]
File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in <listcomp>
keep = [t.to(device) for t in keep]
File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Versions: torch==1.7.1, torch_geometric==1.7.0
This seems to be a Pytorch issue with the DataLoader. I fixed the issue by inserting the following lines at line 18 in train.py (and indent later lines accordingly):
if __name__ == '__main__':
torch.multiprocessing.set_start_method('spawn', force=True)
and changing line 240 in featurization.py to num_workers=1,.
Just in case anyone else has the same issue, I received the following error when during training.
Versions: torch==1.7.1, torch_geometric==1.7.0
This seems to be a Pytorch issue with the DataLoader. I fixed the issue by inserting the following lines at line 18 in
train.py
(and indent later lines accordingly):and changing line 240 in
featurization.py
tonum_workers=1,
.