Open lfxx opened 4 years ago
It depends on your batch size and image size. As for our training, we use 512x512 images with a batch size of 10 for each 2080Ti (11G) GPU.
It depends on your batch size and image size. As for our training, we use 512x512 images with a batch size of 10 for each 2080Ti (11G) GPU.
Use two 1080TI(11G)GPU also get this out of memory error.here is error info:
Which PYTHON: /data/.conda/envs/gca/bin/python
True
Torch Version: 1.1.0
True
Torch Version: 1.1.0
CONFIG:
{'data': {'augmentation': True,
'crop_size': 512,
'random_interp': False,
'test_alpha': '/data/gca_datasets/val_new_alpha',
'test_merged': '/data/gca_datasets/val_merged',
'test_trimap': '/data/gca_datasets/val_trimap',
'train_alpha': '/data/gca_datasets/train_alpha',
'train_bg': '/data/gca_datasets/train2017',
'train_fg': '/data/gca_datasets/train_foreground',
'workers': 4},
'dist': True,
'gpu': 0,
'is_default': False,
'local_rank': 0,
'log': {'checkpoint_path': '/data/gca_datasets/gca_checkpoints/gca-dist',
'checkpoint_step': 2000,
'logging_level': 'INFO',
'logging_path': './logs/stdout/gca-dist',
'logging_step': 10,
'tensorboard_image_step': 2000,
'tensorboard_path': './logs/tensorboard/gca-dist',
'tensorboard_step': 100},
'model': {'arch': {'decoder': 'res_gca_decoder_22',
'discriminator': None,
'encoder': 'resnet_gca_encoder_29'},
'batch_size': 10,
'imagenet_pretrain': False,
'imagenet_pretrain_path': 'pretrain/model_best_resnet34_En_nomixup.pth',
'trimap_channel': 3},
'phase': 'train',
'test': {'alpha': '/data/gca_datasets/test_new_alpha',
'alpha_path': 'prediction/gca-dist',
'batch_size': 1,
'checkpoint': 'gca-dist',
'cpu': False,
'fast_eval': True,
'merged': '/data/gca_datasets/test_merged',
'scale': 'origin',
'trimap': '/data/gca_datasets/test_trimap'},
'train': {'G_lr': 0.0004,
'beta1': 0.5,
'beta2': 0.999,
'clip_grad': True,
'comp_weight': 0,
'gabor_weight': 0,
'grad_weight': 0,
'rec_weight': 1,
'reset_lr': False,
'resume_checkpoint': None,
'smooth_l1_weight': 0,
'total_step': 200000,
'val_step': 2000,
'warmup_step': 5000},
'version': 'gca-dist',
'world_size': 1}
[04-03 18:46:54] INFO: TRAIN: 20031 foreground/images are valid
[04-03 18:46:54] INFO: TEST: 3000 foreground/images are valid
[04-03 18:46:55] INFO: Using pytorch synced BN
[04-03 18:46:55] INFO: DistributedDataParallel(
[04-03 18:50:25] INFO: gca-dist
[04-03 18:50:25] INFO: Number of parameters: 25269144
[04-03 18:50:25] INFO: Load Imagenet Pretrained: pretrain/model_best_resnet34_En_nomixup.pth
[04-03 18:50:38] INFO: [0/200000], REC: 0.5313, lr: 0.000000
Traceback (most recent call last):
File "main.py", line 131, in <module>
main()
File "main.py", line 73, in main
trainer.train()
File "/home/lminc/workspace/GCA-Matting/trainer.py", line 297, in train
alpha_pred, info_dict = self.G(image, trimap)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/lminc/workspace/GCA-Matting/networks/generators.py", line 23, in forward
embedding, mid_fea = self.encoder(inp)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/lminc/workspace/GCA-Matting/networks/encoders/res_gca_enc.py", line 79, in forward
x4 = self.layer3(x3) # N x 256 x 32 x 32
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/lminc/workspace/GCA-Matting/networks/encoders/resnet_enc.py", line 41, in forward
out = self.conv2(out)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/lminc/workspace/GCA-Matting/networks/ops.py", line 80, in forward
return self.module.forward(*args)
File "/data/.conda/envs/gca/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 8.16 GiB (GPU 0; 10.91 GiB total capacity; 873.42 MiB already allocated; 8.17 GiB free; 956.58 MiB cached)
What should i do now.So confused:(@Yaoyi-Li
Can you train the model with a batch size of 9 or 8? Or could you provide some nvidia-smi
information when you are training with smaller batch size? We have tested the training on 1080Ti and it works fine.
Can you train the model with a batch size of 9 or 8? Or could you provide some
nvidia-smi
information when you are training with smaller batch size? We have tested the training on 1080Ti and it works fine.
Use batch_size=8,get below out of memory error:
Which PYTHON: /data/.conda/envs/gca/bin/python
True
Torch Version: 1.1.0
True
Torch Version: 1.1.0
CONFIG:
{'data': {'augmentation': True,
'crop_size': 512,
'random_interp': False,
'test_alpha': '/data/gca_datasets/val_new_alpha',
'test_merged': '/data/gca_datasets/val_merged',
'test_trimap': '/data/gca_datasets/val_trimap',
'train_alpha': '/data/gca_datasets/train_alpha',
'train_bg': '/data/gca_datasets/train2017',
'train_fg': '/data/gca_datasets/train_foreground',
'workers': 4},
'dist': True,
'gpu': 0,
'is_default': False,
'local_rank': 0,
'log': {'checkpoint_path': '/data/gca_datasets/gca_checkpoints/gca-dist',
'checkpoint_step': 2000,
'logging_level': 'INFO',
'logging_path': './logs/stdout/gca-dist',
'logging_step': 10,
'tensorboard_image_step': 2000,
'tensorboard_path': './logs/tensorboard/gca-dist',
'tensorboard_step': 100},
'model': {'arch': {'decoder': 'res_gca_decoder_22',
'discriminator': None,
'encoder': 'resnet_gca_encoder_29'},
'batch_size': 8,
'imagenet_pretrain': True,
'imagenet_pretrain_path': 'pretrain/model_best_resnet34_En_nomixup.pth',
'trimap_channel': 3},
'phase': 'train',
'test': {'alpha': '/data/gca_datasets/test_new_alpha',
'alpha_path': 'prediction/gca-dist',
'batch_size': 1,
'checkpoint': 'gca-dist',
'cpu': False,
'fast_eval': True,
'merged': '/data/gca_datasets/test_merged',
'scale': 'origin',
'trimap': '/data/gca_datasets/test_trimap'},
'train': {'G_lr': 0.0004,
'beta1': 0.5,
'beta2': 0.999,
'clip_grad': True,
'comp_weight': 0,
'gabor_weight': 0,
'grad_weight': 0,
'rec_weight': 1,
'reset_lr': False,
'resume_checkpoint': None,
'smooth_l1_weight': 0,
'total_step': 200000,
'val_step': 2000,
'warmup_step': 5000},
'version': 'gca-dist',
'world_size': 1}
[04-03 21:09:01] INFO: TRAIN: 20031 foreground/images are valid
[04-03 21:09:01] INFO: TEST: 3000 foreground/images are valid
[04-03 21:09:03] INFO: Using pytorch synced BN
[04-03 21:09:03] INFO: DistributedDataParallel(
(module): Generator(
When the error above appears: the nvidia-smi info is:
Fri Apr 3 21:12:40 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.36 Driver Version: 440.36 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 33% 52C P8 16W / 250W | 9350MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 0% 46C P8 12W / 300W | 20MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1228 G /usr/lib/xorg/Xorg 18MiB |
| 0 1266 G /usr/bin/gnome-shell 49MiB |
| 0 1574 G /usr/lib/xorg/Xorg 144MiB |
| 0 1725 G /usr/bin/gnome-shell 92MiB |
| 0 2232 G /opt/teamviewer/tv_bin/TeamViewer 11MiB |
| 0 2809 G ...downloads/pycharm-2019.3.4/jbr/bin/java 3MiB |
| 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB |
+-----------------------------------------------------------------------------+
What's wrong with this,please get me out:(@Yaoyi-Li
I have no idea what's happening here, but it looks like you are using only one GPU.
But maybe I find out something else.
| 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB |
this training is not out of memory.
The time when you run the training is
[04-03 21:09:03] INFO: DistributedDataParallel(
and the time of this nvidia-smi info is
Fri Apr 3 21:12:40 2020
If the traing has OOM issue, the process will been killed by system. So I guess there may be anthor training process running on your GPU. You can try to kill this process and try it again.
I have no idea what's happening here, but it looks like you are using only one GPU. But maybe I find out something else.
| 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB |
this training is not out of memory. The time when you run the training is[04-03 21:09:03] INFO: DistributedDataParallel(
and the time of this nvidia-smi info isFri Apr 3 21:12:40 2020
If the traing has OOM issue, the process will been killed by system. So I guess there may be anthor training process running on your GPU. You can try to kill this process and try it again.
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug?@Yaoyi-Li
I have no idea what's happening here, but it looks like you are using only one GPU. But maybe I find out something else.
| 0 4376 C /data/.conda/envs/gca/bin/python 9013MiB |
this training is not out of memory. The time when you run the training is[04-03 21:09:03] INFO: DistributedDataParallel(
and the time of this nvidia-smi info isFri Apr 3 21:12:40 2020
If the traing has OOM issue, the process will been killed by system. So I guess there may be anthor training process running on your GPU. You can try to kill this process and try it again.
hi,it's me again.Now i finished training with your code.But the model i trained is really worse.I wonder that my datasets is not good.I don't know how to prepare the datasets without template.I asked the DIM author to give me their dataset but they refused.Can you please email me some demo images from DIM include training and testing to help me debug the code.Just a few images is enough:)@Yaoyi-Li
I'm sorry, but I can't. According to Adobe's license, I'm not at liberty to distribute images in this dataset to anyone else. If they refused to give you the dataset, I think it means you are trying to use them for commercial purposes. I am sorry about it.
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.
in your trainer code,at line 291.test part.This code will lead to oom:
image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap']
for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.
Thanks for your reply! Here is my env: CUDA9.0, pytorch1.1.0, 1080ti(11172MB). I can train on single GPU as batch_size=10 while testing all 1000 images. However, it's hard to get SAD as low as yours. When I tried to train on 4 GPUs, even on 2 GPUs, it failed at CUDA oom in the forward procedure.
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.
in your trainer code,at line 291.test part.This code will lead to oom:
image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap']
for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.
Thanks for your advice. But I wonder that how many GPUs did you use? I didn't got the same problem when I use only one GPU..
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.
in your trainer code,at line 291.test part.This code will lead to oom:
image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap']
for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.Thanks for your advice. But I wonder that how many GPUs did you use? I didn't got the same problem when I use only one GPU..
Just uncomment the test procedure in the trianer.py,You will get things ahead.Resize your test image smaller will get test procedure works fine.By the way,what datasets you are using?The DIM or the datesets you made on yourself?
Find the problem,when i train the model,there will be test procedure to determin when to save to model.But if valid image size is too large,will get oom error.Can you fix this bug
Hi! I got same problem as you. Can you share how to deal with oom? Did you train on 2 GPUs?
Hi, could you please provide your PyTorch and CUDA version. I found there are some other people also facing this problem. But I have no idea what happened.
in your trainer code,at line 291.test part.This code will lead to oom:
image, alpha, trimap = image_dict['image'], image_dict['alpha'], image_dict['trimap']
for some images in my own datasets is very large.Resize these images smaller before throw them into test can reslove this issue.But i would like to get an offical fix version from you.Thanks for your advice. But I wonder that how many GPUs did you use? I didn't got the same problem when I use only one GPU..
Just uncomment the test procedure in the trianer.py,You will get things ahead.Resize your test image smaller will get test procedure works fine.By the way,what datasets you are using?The DIM or the datesets you made on yourself?
I use the DIM datasets( I'm a student). Maybe we have different dataset.
Thanks for your reply! Here is my env: CUDA9.0, pytorch1.1.0, 1080ti(11172MB). I can train on single GPU as batch_size=10 while testing all 1000 images. However, it's hard to get SAD as low as yours. When I tried to train on 4 GPUs, even on 2 GPUs, it failed at CUDA oom in the forward procedure.
Hi, I have tried to train the model with CUDA9.0, pytorch1.1.0 on 2 1080tis. I think the version of Cuda doesn't matter. Multi-GPU training won't require much more memory than a single GPU. Did you try to train the model with a smaller batch size like 9 or 8? If you have any hint, could you please share it with us? Thanks
Yaoyi-Li ,i am facing ' 'Generator' object has no attribute 'module' '.How to solve this?
How much GPU memory do we need to train the model@Yaoyi-Li