anujinho / trident

Official repository for the paper TRIDENT: Transductive Decoupled Variational Inference for Few Shot Classification
https://openreview.net/forum?id=bomdTc9HyL
MIT License
39 stars 5 forks source link

CUDA out of memory on an 11GB NVIDIA 2080Ti GPU. #6

Open The-Hulk-007 opened 6 months ago

The-Hulk-007 commented 6 months ago

Hi @anujinho I am trying to reproduce your TRIDENT CCVAE model. however, I am not able to solve this problem, CUDA out of memory on an 11GB NVIDIA 2080Ti GPU._Even if I lower the meta_batchsize(i.e. 10 , 4 , 1), I can't get it to work. Below is my running configuration, parameters and screenshots:

e.g. mini-5,1,train_conf.json

1.running configuration python -m src.trident_train --cnfg /home/zzh/projectLists/trident/configs/mini-5,1/train_conf.json

2._trainconf.json hyperparameters:

{
"dataset": "miniimagenet",
"root": "./dataset/mini_imagenet",
"n_ways": 5,
"k_shots": 1,
"q_shots": 10,
"inner_adapt_steps_train": 5,
"inner_adapt_steps_test": 5,
"inner_lr": 0.001,
"meta_lr": 0.0001,
"meta_batch_size":20,
"iterations": 100000,
"reconstr": "std",
"wt_ce": 100,
"klwt": "False",
"rec_wt": 0.01,
"beta_l": 1,
"beta_s": 1,
"zl": 64,
"zs": 64,
"task_adapt": "True",
"experiment": "exp1",
"order": "False",
"device": "cuda:3"
}

3.Screenshot of results:

  0%|  | 0/100000 [00:00<?, ?it/s]/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  /opt/conda/conda-bld/pytorch_1639180549130/work/build/aten/src/ATen/core/TensorBody.h:417.)
  return self._grad
  0%|                                                                                 | 4/100000 [00:27<193:24:54,  6.96s/it]
Traceback (most recent call last):
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zzh/projectLists/trident/src/trident_train.py", line 103, in <module>
    evaluation_loss, evaluation_accuracy = inner_adapt_trident(
  File "/home/zzh/projectLists/trident/src/zoo/trident_utils.py", line 125, in inner_adapt_trident
    reconst_image, logits, mu_l, log_var_l, mu_s, log_var_s = learner(
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/learn2learn/algorithms/maml.py", line 107, in forward
    return self.module(*args, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/projectLists/trident/src/zoo/archs.py", line 814, in forward
    logits, mu_l, log_var_l, z_l = self.classifier_vae(x, z_s, update)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/projectLists/trident/src/zoo/archs.py", line 752, in forward
    x = self.encoder(x, update)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/projectLists/trident/src/zoo/archs.py", line 533, in forward
    x = self.net(x)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 738, in forward
    return F.leaky_relu(input, self.negative_slope, self.inplace)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/functional.py", line 1475, in leaky_relu
    result = torch._C._nn.leaky_relu(input, negative_slope)

RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 3; 10.76 GiB total capacity; 9.63 GiB already allocated; 49.12 MiB free; 9.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

image

I also learned from the #Issues that the code in this paper does not run in a distributed manner. However, I tried many methods but could not solve it.Can you please help to resolve this issue or give valuable suggestions. Will look forward to hearing from you soon. Thanks!