Closed nama1arpit closed 2 years ago
Hi! I think the problem is that PyTorch 1.2 is not guaranteed to work with cuda 11. In our experiments we tested the code with cuda 10.0 and cuda 10.1. Have you tried with lower CUDA versions?
Hi @antoalli, as CUDA 11.4 supports PyTorch 1.9 very well, I updated PyTorch and the training program gives me this error:
Traceback (most recent call last):
File "train_deco.py", line 437, in <module>
main_worker() # train DeCo
File "train_deco.py", line 309, in main_worker
feat = gl_encoder(partials)
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/$USER/Deco/models/model_deco.py", line 95, in forward
g_feat = self.global_encoder(points) # [B, global_emb]
File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/$USER/Deco/model_utils.py", line 184, in forward
x = get_graph_feature(x2, k=self.k)
File "/share/home/$USER/Deco/model_utils.py", line 61, in get_graph_feature
feature = torch.cat((feature - x, x), dim=3).permute(0, 3, 1, 2)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 9.78 GiB total capacity; 8.13 GiB already allocated; 26.56 MiB free; 8.33 GiB reserved in total by PyTorch)
As I am using latest GPU cards and should have enough memory, I am curious if you had any issues like this? Any hints on how to proceed with it.
Hi! Unfortunately not, we used the described environment (cuda 10 + torch 1.2). From PyTorch page I see that PyTorch 1.9 supports cuda 11.1 and not 11.4 (https://pytorch.org/get-started/locally/) which is part of your problem. Have you tried with a conda environment? In conda environments you should be able to install different cuda toolkit versions and easily replicate our environment. Alternatively you can use docker. More important Regarding the error message you attached: this is a standard CUDA out of memory message. On how many gpus are you training? Which batch size are you using? Note that (as explained in the README) our code ran on 2 x 24gb Titan RTX with batch size 30 and parallelism implemented through PyTorch DataParallel. For any questions feel free to reach me here or via email, Antonio
Hi! as it was a standard memory problem, I tried a lower batch size of 20 and it worked. Do you think it will affect the results significantly?
I don't think the batch size will affect results. Anyway I suggest you to use the same pytorch and cuda version as specified in the readme.
Definitely, thanks a lot @antoalli!
Hi antoalli,
I have the following error when I tried to train the Deco model using
python train_deco.py --data_root data/shapenetcore_partanno_segmentation_benchmark_v0 --exp_name deco1280_512_wFrame512_sn13 --config configs/deco_config.json --parallel -P1 1280 -P2 512 --raw_weight 1
:Environment: System: Ubuntu 20.04.3 Python: 3.7.9 PyTorch: 1.2.0 CUDA Version: 11.4 gcc version 9.3.0
I highly appreciate your help regarding this error. Let me know if you need any other information.