Closed munib94 closed 3 years ago
Can you try the unmodified version first? It already supports single-GPU training. Let me know if you get the same error
I've pinpointed the error to the input depth. With the input depth at <= 5, the network trains as expected. If I modify it to 8 instead of 5 then I get the error. When I trained with input depth 8 and multiple GPUs enabled, I get a different error:
>auto::operator()(int)->auto: block: [1367,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653183467/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f1ad3ebcdc5 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14792 (0x7f1ad0d14792 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f1ad3eac640 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x3067fb (0x7f1ad14337fb in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x13ff1b (0x7f1af9c9cf1b in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3bf384 (0x7f1af9f1c384 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3bf3d1 (0x7f1af9f1c3d1 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f1b0eec4445 in /lib64/libc.so.6)
Aborted (core dumped)
When I tried training again with an input depth of 8 and single GPU, I got a THCudaCheck FAIL
error message in addition to the above error message:
>auto::operator()(int)->auto: block: [1387,0,0], thread: [94,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [1387,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
File "./train.py", line 115, in <module>
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 238, in train
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 319, in train_epoch
output = model(in_vol, proj_mask)
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "../../tasks/semantic/modules/segmentator.py", line 149, in forward
y, skips = self.backbone(x)
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "../..//backbones/darknet.py", line 171, in forward
x, skips, os = self.run_layer(x, self.conv1, skips, os)
File "../..//backbones/darknet.py", line 154, in run_layer
y = layer(x)
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Hi, depth 5 is for now a constant. It means using range, x, y, z, and remission as inputs. Leave the code for the trainer untouched, as it already supports single and multigpu training
I added three more input channels to account for the extra data I am providing. Does this mean the network does not support input depths greater than 5?
No, it should be fine if you modify the code properly. Where did you modify it? If you share your mods with me maybe I can point you to where you're missing something/doing something wrong
Sorry for the delay, I was gathering my edits into a single file. I sent you an email at your uni-bonn.de email address listed on you github profile.
Not sure if you reviewed my code, but do you think it is a CUDA/cuDNN compatibility issue? Which versions did you use?
Hi, I have not looked at the email yet, I will do so now and let you know. If there is anything to be learned for others we will post it here
You're on the right track. You need to modify this line
From:
proj = torch.cat([proj_range.unsqueeze(0).clone(),
proj_xyz.clone().permute(2, 0, 1),
proj_remission.unsqueeze(0).clone()])
to:
proj = torch.cat([proj_range.unsqueeze(0).clone(),
proj_xyz.clone().permute(2, 0, 1),
proj_remission.unsqueeze(0).clone(),
proj_extra_param.clone().permute(2,0,1)])
To include the extra parameters into the volume going into the CNN.
You will need to modify the normalization parameters to have 8 parameters as well.
You should do print(proj.shape)
after that to check that the shape is [8,H,W]
Thanks! For the normalization parameters, are you referring to the batchnorm.py file? I noticed that the SyncBatchNorm3D class applies batch normalization over a 5d input. Is this what I need to modifiy to work for an 8d input?
Hi,
No, I mean this line.
It uses the means and stds over the input data to normalize the inputs. The 5d values are at the end of each config yaml file. You will need to add your 3 values at the end of each, otherwise that line will fail. Otherwise, your code should run fine. But I haven't tested it myself, it just looks properly modified in the right places.
It works! Thank you! Just a final question, I noticed that the means and stds only go up to two decimal places. Is there a significant difference in accuracy/performance if more significant figures are used?
Update: ran into an error after the first epoch:
Best mean iou in training set so far, save model!
********************************************************************************
Traceback (most recent call last):
File "./train.py", line 115, in <module>
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 261, in train
save_scans=self.ARCH["train"]["save_scans"])
File "../../tasks/semantic/modules/trainer.py", line 408, in validate
for i, (in_vol, proj_mask, proj_labels, _, path_seq, path_name, _, _, _, _, _, _, _, _, _, _, _) in enumerate(val_loader):
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
return self._process_next_batch(batch)
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
AttributeError: Traceback (most recent call last):
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "../..//tasks/semantic/dataset/kitti/parser.py", line 224, in __getitem__
proj_extra_param.clone().permute(2, 0, 1)])
AttributeError: 'list' object has no attribute 'clone'
Not sure why this happened. Any ideas?
the message is pretty clear there, you're trying to run the pytorch clone
method, which only applies to torch tensors to a list object. Make sure that your data is in the proper format (as a torch tensor) and not as a python list
I'm confused because I already had the following code to convert from a NumPy array to a Torch tensor in the parser.py file:
if self.extra_param:
proj_extra_param = torch.from_numpy(scan.proj_extra_param).clone()
else:
proj_extra_param = []
The network has no problem training the first epoch, but when it tries training for the second epoch, the error appears.
I close this issue since there seems not to be much activity here or the problem resolved. If you still have problems, please re-open the issue.
I'm having this error when training. My system configurations are listed below.
OS: CentOS Linux 7 PyTorch version: 1.1.0 (installed using conda) TensorFlow-gpu version: 1.9.0 Python version: 3.6.8 CUDA/cuDNN version: 9.0/7.0.5 GPU: Nvidia GPU GeForce GTX 1080
I modified the network so that it has an input depth of 8 instead of 5 and noticed that this issue only appears when the depth is greater than 5. I can't figure out how to resolve the error message though.
Any ideas on how to fix this?