CUDNN_STATUS_MAPPING_ERROR while training the DeCo model

nama1arpit commented 3 years ago

Hi antoalli,

I have the following error when I tried to train the Deco model using python train_deco.py --data_root data/shapenetcore_partanno_segmentation_benchmark_v0 --exp_name deco1280_512_wFrame512_sn13 --config configs/deco_config.json --parallel -P1 1280 -P2 512 --raw_weight 1 :

Arguments: Namespace(batch_size=30, checkpoints_dir='experiments_deco', class_choice='Airplane,Bag,Cap,Car,Chair,Guitar,Lamp,Laptop,Motorbike,Mug,Pistol,Skateboard,Table', config='configs/deco_config.json', context_point_num=512, crop_point_num=512, data_root='data/shapenetcore_partanno_segmentation_benchmark_v0', epochs=241, exp_name='deco1280_512_wFrame512_sn13_0', fps_centroids=False, it_test=10, manualSeed=7926, models_dir='experiments_deco/deco1280_512_wFrame512_sn13_0/models', num_holes=1, parallel=True, pool1_points=1280, pool2_points=512, raw_weight=1.0, restart_from='', save_dir='experiments_deco/deco1280_512_wFrame512_sn13_0', vis_dir='experiments_deco/deco1280_512_wFrame512_sn13_0/train_visz', workers=12)
Configuration: {'global_encoder': {'nearest_neighboors': 24, 'latent_dim': 1024}, 'generator': {'latent_dim': 256, 'nearest_neighboors': 16, 'pool1_nn': 16, 'pool2_nn': 6, 'scoring_fun': 'tanh'}, 'completion_trainer': {'checkpoint_global_enco': 'pretext_weights/global_contrastive/cropRotScaleJitt_4pos_1602780721.pth', 'checkpoint_local_enco': 'pretext_weights/gpd_local_denoising/gpd_residual_nn8_mse_ep100.pth', 'data_root': '/home/antonioa/data/shapenetcore_partanno_segmentation_benchmark_v0', 'num_points': 2048, 'enco_lr': 0.001, 'enco_step': 25, 'gen_lr': 0.001, 'gen_step': 40}, 'GPD_local': {'pre_Nfeat': [3, 33, 66, 99], 'conv_n_layers': 3, 'conv_layer': {'in_feat': 99, 'fnet_feat': 99, 'out_feat': 99, 'rank_theta': 11, 'stride_th1': 33, 'stride_th2': 33, 'min_nn': 8}}}
Class choice list: ['Airplane', 'Bag', 'Cap', 'Car', 'Chair', 'Guitar', 'Lamp', 'Laptop', 'Motorbike', 'Mug', 'Pistol', 'Skateboard', 'Table']
Training Completion Task...
Local FE pretrained weights - loading res: _IncompatibleKeys(missing_keys=[], unexpected_keys=['l_conv.sl_W', 'l_conv.sl_b', 'l_conv.gconv.W_flayer_th0', 'l_conv.gconv.b_flayer_th0', 'l_conv.gconv.W_flayer_th1', 'l_conv.gconv.b_flayer_th1', 'l_conv.gconv.W_flayer_th2', 'l_conv.gconv.b_flayer_th2', 'l_conv.gconv.W_flayer_thl', 'l_conv.gconv.b_flayer_thl'])
Global FE pretrained weights - loading res: <All keys matched successfully>
Num GPUs: 4, Parallelism: True
Centroids: [[ 1  0  0]
 [ 0  0  1]
 [ 1  0  1]
 [-1  0  0]
 [-1  1  0]]
Training.. 
Traceback (most recent call last):
  File "train_deco.py", line 437, in <module>
    main_worker()  # train DeCo
  File "train_deco.py", line 309, in main_worker
    feat = gl_encoder(partials)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/$USER/Deco/models/model_deco.py", line 94, in forward
    l_feat = self.local_encoder(points)  # [B, 99, N]
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/$USER/Deco/models/model_deco.py", line 49, in forward
    h = self.pre_nn_layers(points)  # [B, in_feat, N]
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 200, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Environment: System: Ubuntu 20.04.3 Python: 3.7.9 PyTorch: 1.2.0 CUDA Version: 11.4 gcc version 9.3.0

I highly appreciate your help regarding this error. Let me know if you need any other information.

antoalli commented 3 years ago

Hi! I think the problem is that PyTorch 1.2 is not guaranteed to work with cuda 11. In our experiments we tested the code with cuda 10.0 and cuda 10.1. Have you tried with lower CUDA versions?

nama1arpit commented 2 years ago

Hi @antoalli, as CUDA 11.4 supports PyTorch 1.9 very well, I updated PyTorch and the training program gives me this error:

Traceback (most recent call last):
  File "train_deco.py", line 437, in <module>
    main_worker()  # train DeCo
  File "train_deco.py", line 309, in main_worker
    feat = gl_encoder(partials)
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share/home/$USER/Deco/models/model_deco.py", line 95, in forward
    g_feat = self.global_encoder(points)  # [B, global_emb]
  File "/home/$USER/miniconda3/envs/deco_env2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share/home/$USER/Deco/model_utils.py", line 184, in forward
    x = get_graph_feature(x2, k=self.k)
  File "/share/home/$USER/Deco/model_utils.py", line 61, in get_graph_feature
    feature = torch.cat((feature - x, x), dim=3).permute(0, 3, 1, 2)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 9.78 GiB total capacity; 8.13 GiB already allocated; 26.56 MiB free; 8.33 GiB reserved in total by PyTorch)

As I am using latest GPU cards and should have enough memory, I am curious if you had any issues like this? Any hints on how to proceed with it.

antoalli commented 2 years ago

Hi! Unfortunately not, we used the described environment (cuda 10 + torch 1.2). From PyTorch page I see that PyTorch 1.9 supports cuda 11.1 and not 11.4 (https://pytorch.org/get-started/locally/) which is part of your problem. Have you tried with a conda environment? In conda environments you should be able to install different cuda toolkit versions and easily replicate our environment. Alternatively you can use docker. More important Regarding the error message you attached: this is a standard CUDA out of memory message. On how many gpus are you training? Which batch size are you using? Note that (as explained in the README) our code ran on 2 x 24gb Titan RTX with batch size 30 and parallelism implemented through PyTorch DataParallel. For any questions feel free to reach me here or via email, Antonio

nama1arpit commented 2 years ago

Hi! as it was a standard memory problem, I tried a lower batch size of 20 and it worked. Do you think it will affect the results significantly?

antoalli commented 2 years ago

I don't think the batch size will affect results. Anyway I suggest you to use the same pytorch and cuda version as specified in the readme.

nama1arpit commented 2 years ago

Definitely, thanks a lot @antoalli!

antoalli / Deco

CUDNN_STATUS_MAPPING_ERROR while training the DeCo model #4