Train on a single GTX1080TI(11G）,Get CUDA out of memory error

lfxx commented 4 years ago

How much GPU memory do we need to train the model@Yaoyi-Li

Yaoyi-Li commented 4 years ago

It depends on your batch size and image size. As for our training, we use 512x512 images with a batch size of 10 for each 2080Ti (11G) GPU.

lfxx commented 4 years ago

It depends on your batch size and image size. As for our training, we use 512x512 images with a batch size of 10 for each 2080Ti (11G) GPU.

Use two 1080TI(11G)GPU also get this out of memory error.here is error info:


Which PYTHON: /data/.conda/envs/gca/bin/python
True
Torch Version:  1.1.0
True
Torch Version:  1.1.0
CONFIG: 
{'data': {'augmentation': True,
          'crop_size': 512,
          'random_interp': False,
          'test_alpha': '/data/gca_datasets/val_new_alpha',
          'test_merged': '/data/gca_datasets/val_merged',
          'test_trimap': '/data/gca_datasets/val_trimap',
          'train_alpha': '/data/gca_datasets/train_alpha',
          'train_bg': '/data/gca_datasets/train2017',
          'train_fg': '/data/gca_datasets/train_foreground',
          'workers': 4},
 'dist': True,
 'gpu': 0,
 'is_default': False,
 'local_rank': 0,
 'log': {'checkpoint_path': '/data/gca_datasets/gca_checkpoints/gca-dist',
         'checkpoint_step': 2000,
         'logging_level': 'INFO',
         'logging_path': './logs/stdout/gca-dist',
         'logging_step': 10,
         'tensorboard_image_step': 2000,
         'tensorboard_path': './logs/tensorboard/gca-dist',
         'tensorboard_step': 100},
 'model': {'arch': {'decoder': 'res_gca_decoder_22',
                    'discriminator': None,
                    'encoder': 'resnet_gca_encoder_29'},
           'batch_size': 10,
           'imagenet_pretrain': False,
           'imagenet_pretrain_path': 'pretrain/model_best_resnet34_En_nomixup.pth',
           'trimap_channel': 3},
 'phase': 'train',
 'test': {'alpha': '/data/gca_datasets/test_new_alpha',
          'alpha_path': 'prediction/gca-dist',
          'batch_size': 1,
          'checkpoint': 'gca-dist',
          'cpu': False,
          'fast_eval': True,
          'merged': '/data/gca_datasets/test_merged',
          'scale': 'origin',
          'trimap': '/data/gca_datasets/test_trimap'},
 'train': {'G_lr': 0.0004,
           'beta1': 0.5,
           'beta2': 0.999,
           'clip_grad': True,
           'comp_weight': 0,
           'gabor_weight': 0,
           'grad_weight': 0,
           'rec_weight': 1,
           'reset_lr': False,
           'resume_checkpoint': None,
           'smooth_l1_weight': 0,
           'total_step': 200000,
           'val_step': 2000,
           'warmup_step': 5000},
 'version': 'gca-dist',
 'world_size': 1}
[04-03 18:46:54] INFO: TRAIN: 20031 foreground/images are valid
[04-03 18:46:54] INFO: TEST: 3000 foreground/images are valid
[04-03 18:46:55] INFO: Using pytorch synced BN
[04-03 18:46:55] INFO: DistributedDataParallel(
[04-03 18:50:25] INFO: gca-dist
[04-03 18:50:25] INFO: Number of parameters: 25269144
[04-03 18:50:25] INFO: Load Imagenet Pretrained: pretrain/model_best_resnet34_En_nomixup.pth
[04-03 18:50:38] INFO: [0/200000], REC: 0.5313, lr: 0.000000
Traceback (most recent call last):
  File "main.py", line 131, in <module>
    main()
  File "main.py", line 73, in main
    trainer.train()
  File "/home/lminc/workspace/GCA-Matting/trainer.py", line 297, in train
    alpha_pred, info_dict = self.G(image, trimap)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lminc/workspace/GCA-Matting/networks/generators.py", line 23, in forward
    embedding, mid_fea = self.encoder(inp)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lminc/workspace/GCA-Matting/networks/encoders/res_gca_enc.py", line 79, in forward
    x4 = self.layer3(x3) # N x 256 x 32 x 32
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lminc/workspace/GCA-Matting/networks/encoders/resnet_enc.py", line 41, in forward
    out = self.conv2(out)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lminc/workspace/GCA-Matting/networks/ops.py", line 80, in forward
    return self.module.forward(*args)
  File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 8.16 GiB (GPU 0; 10.91 GiB total capacity; 873.42 MiB already allocated; 8.17 GiB free; 956.58 MiB cached)

What should i do now.So confused:(@Yaoyi-Li

Yaoyi-Li commented 4 years ago

Can you train the model with a batch size of 9 or 8? Or could you provide some nvidia-smi information when you are training with smaller batch size? We have tested the training on 1080Ti and it works fine.

lfxx commented 4 years ago

Can you train the model with a batch size of 9 or 8? Or could you provide some nvidia-smi information when you are training with smaller batch size? We have tested the training on 1080Ti and it works fine.

Use batch_size=8,get below out of memory error:

Which PYTHON: /data/.conda/envs/gca/bin/python
True
Torch Version:  1.1.0
True
Torch Version:  1.1.0
CONFIG: 
{'data': {'augmentation': True,
          'crop_size': 512,
          'random_interp': False,
          'test_alpha': '/data/gca_datasets/val_new_alpha',
          'test_merged': '/data/gca_datasets/val_merged',
          'test_trimap': '/data/gca_datasets/val_trimap',
          'train_alpha': '/data/gca_datasets/train_alpha',
          'train_bg': '/data/gca_datasets/train2017',
          'train_fg': '/data/gca_datasets/train_foreground',
          'workers': 4},
 'dist': True,
 'gpu': 0,
 'is_default': False,
 'local_rank': 0,
 'log': {'checkpoint_path': '/data/gca_datasets/gca_checkpoints/gca-dist',
         'checkpoint_step': 2000,
         'logging_level': 'INFO',
         'logging_path': './logs/stdout/gca-dist',
         'logging_step': 10,
         'tensorboard_image_step': 2000,
         'tensorboard_path': './logs/tensorboard/gca-dist',
         'tensorboard_step': 100},
 'model': {'arch': {'decoder': 'res_gca_decoder_22',
                    'discriminator': None,
                    'encoder': 'resnet_gca_encoder_29'},
           'batch_size': 8,
           'imagenet_pretrain': True,
           'imagenet_pretrain_path': 'pretrain/model_best_resnet34_En_nomixup.pth',
           'trimap_channel': 3},
 'phase': 'train',
 'test': {'alpha': '/data/gca_datasets/test_new_alpha',
          'alpha_path': 'prediction/gca-dist',
          'batch_size': 1,
          'checkpoint': 'gca-dist',
          'cpu': False,
          'fast_eval': True,
          'merged': '/data/gca_datasets/test_merged',
          'scale': 'origin',
          'trimap': '/data/gca_datasets/test_trimap'},
 'train': {'G_lr': 0.0004,
           'beta1': 0.5,
           'beta2': 0.999,
           'clip_grad': True,
           'comp_weight': 0,
           'gabor_weight': 0,
           'grad_weight': 0,
           'rec_weight': 1,
           'reset_lr': False,
           'resume_checkpoint': None,
           'smooth_l1_weight': 0,
           'total_step': 200000,
           'val_step': 2000,
           'warmup_step': 5000},
 'version': 'gca-dist',
 'world_size': 1}
[04-03 21:09:01] INFO: TRAIN: 20031 foreground/images are valid
[04-03 21:09:01] INFO: TEST: 3000 foreground/images are valid
[04-03 21:09:03] INFO: Using pytorch synced BN
[04-03 21:09:03] INFO: DistributedDataParallel(
  (module): Generator(

When the error above appears: the nvidia-smi info is:

Fri Apr  3 21:12:40 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.36       Driver Version: 440.36       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 33%   52C    P8    16W / 250W |   9350MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   46C    P8    12W / 300W |     20MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1228      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1266      G   /usr/bin/gnome-shell                          49MiB |
|    0      1574      G   /usr/lib/xorg/Xorg                           144MiB |
|    0      1725      G   /usr/bin/gnome-shell                          92MiB |
|    0      2232      G   /opt/teamviewer/tv_bin/TeamViewer             11MiB |
|    0      2809      G   ...downloads/pycharm-2019.3.4/jbr/bin/java     3MiB |
|    0      4376      C   /data/.conda/envs/gca/bin/python            9013MiB |
+-----------------------------------------------------------------------------+

What's wrong with this,please get me out:(@Yaoyi-Li

Yaoyi-Li commented 4 years ago

I have no idea what's happening here, but it looks like you are using only one GPU. But maybe I find out something else. | 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB | this training is not out of memory. The time when you run the training is [04-03 21:09:03] INFO: DistributedDataParallel( and the time of this nvidia-smi info is Fri Apr 3 21:12:40 2020 If the traing has OOM issue, the process will been killed by system. So I guess there may be anthor training process running on your GPU. You can try to kill this process and try it again.

lfxx commented 4 years ago

I have no idea what's happening here, but it looks like you are using only one GPU. But maybe I find out something else. | 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB | this training is not out of memory. The time when you run the training is [04-03 21:09:03] INFO: DistributedDataParallel( and the time of this nvidia-smi info is Fri Apr 3 21:12:40 2020 If the traing has OOM issue, the process will been killed by system. So I guess there may be anthor training process running on your GPU. You can try to kill this process and try it again.

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug?@Yaoyi-Li

lfxx commented 4 years ago

I have no idea what's happening here, but it looks like you are using only one GPU. But maybe I find out something else. | 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB | this training is not out of memory. The time when you run the training is [04-03 21:09:03] INFO: DistributedDataParallel( and the time of this nvidia-smi info is Fri Apr 3 21:12:40 2020 If the traing has OOM issue, the process will been killed by system. So I guess there may be anthor training process running on your GPU. You can try to kill this process and try it again.

hi,it's me again.Now i finished training with your code.But the model i trained is really worse.I wonder that my datasets is not good.I don't know how to prepare the datasets without template.I asked the DIM author to give me their dataset but they refused.Can you please email me some demo images from DIM include training and testing to help me debug the code.Just a few images is enough:)@Yaoyi-Li

Yaoyi-Li commented 4 years ago

I'm sorry, but I can't. According to Adobe's license, I'm not at liberty to distribute images in this dataset to anyone else. If they refused to give you the dataset, I think it means you are trying to use them for commercial purposes. I am sorry about it.

CocoRLin commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Yaoyi-Li commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.

lfxx commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.

in your trainer code,at line 291.test part.This code will lead to oom: image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap'] for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.

CocoRLin commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.

Thanks for your reply! Here is my env: CUDA9.0, pytorch1.1.0, 1080ti(11172MB). I can train on single GPU as batch_size=10 while testing all 1000 images. However, it's hard to get SAD as low as yours. When I tried to train on 4 GPUs, even on 2 GPUs, it failed at CUDA oom in the forward procedure.

CocoRLin commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.

in your trainer code,at line 291.test part.This code will lead to oom: image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap'] for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.

Thanks for your advice. But I wonder that how many GPUs did you use? I didn't got the same problem when I use only one GPU..

lfxx commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.

in your trainer code,at line 291.test part.This code will lead to oom: image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap'] for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.

Thanks for your advice. But I wonder that how many GPUs did you use? I didn't got the same problem when I use only one GPU..

Just uncomment the test procedure in the trianer.py,You will get things ahead.Resize your test image smaller will get test procedure works fine.By the way,what datasets you are using?The DIM or the datesets you made on yourself?

CocoRLin commented 4 years ago

Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug

Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?

Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.

in your trainer code,at line 291.test part.This code will lead to oom: image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap'] for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.

Thanks for your advice. But I wonder that how many GPUs did you use? I didn't got the same problem when I use only one GPU..

Just uncomment the test procedure in the trianer.py,You will get things ahead.Resize your test image smaller will get test procedure works fine.By the way,what datasets you are using?The DIM or the datesets you made on yourself?

I use the DIM datasets( I'm a student). Maybe we have different dataset.

Yaoyi-Li commented 4 years ago

Thanks for your reply! Here is my env: CUDA9.0, pytorch1.1.0, 1080ti(11172MB). I can train on single GPU as batch_size=10 while testing all 1000 images. However, it's hard to get SAD as low as yours. When I tried to train on 4 GPUs, even on 2 GPUs, it failed at CUDA oom in the forward procedure.

Hi, I have tried to train the model with CUDA9.0, pytorch1.1.0 on 2 1080tis. I think the version of Cuda doesn't matter. Multi-GPU training won't require much more memory than a single GPU. Did you try to train the model with a smaller batch size like 9 or 8? If you have any hint, could you please share it with us? Thanks

priyatampintu commented 3 years ago

Screenshot from 2021-09-20 14-25-08 Yaoyi-Li ,i am facing ' 'Generator' object has no attribute 'module' '.How to solve this?

Yaoyi-Li / GCA-Matting

Train on a single GTX1080TI(11G）,Get CUDA out of memory error #8