Closed wuyujack closed 5 years ago
I change the PyTorch version to 0.4 and finally, it works. It would be better if you would like to specify the version of the packages you used in the ReadMe : )
After training for one epoch, a new issue appears:
2019-07-07 00:07:49,035 INFO Epoch: [0][0/1078] Loss 0.3504 (0.3504) 2019-07-07 00:09:40,503 INFO Epoch: [0][100/1078] Loss 0.3535 (0.3414) 2019-07-07 00:11:28,066 INFO Epoch: [0][200/1078] Loss 0.3366 (0.3431) 2019-07-07 00:13:25,253 INFO Epoch: [0][300/1078] Loss 0.3389 (0.3427) 2019-07-07 00:15:20,560 INFO Epoch: [0][400/1078] Loss 0.2967 (0.3424) 2019-07-07 00:17:14,953 INFO Epoch: [0][500/1078] Loss 0.3370 (0.3422) 2019-07-07 00:19:08,667 INFO Epoch: [0][600/1078] Loss 0.3207 (0.3417) 2019-07-07 00:21:05,087 INFO Epoch: [0][700/1078] Loss 0.2947 (0.3408) 2019-07-07 00:23:00,023 INFO Epoch: [0][800/1078] Loss 0.3078 (0.3394) 2019-07-07 00:24:54,484 INFO Epoch: [0][900/1078] Loss 0.3516 (0.3383) 2019-07-07 00:26:46,944 INFO Epoch: [0][1000/1078] Loss 0.3108 (0.3370) Current effective learning rate: 0.0001
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last): File "train.py", line 171, in
main() File "train.py", line 167, in main train_net(args) File "train.py", line 73, in train_net logger=logger) File "train.py", line 146, in valid alpha_out = model(img) # [N, 320, 320] File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply raise output File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker output = module(*input, *kwargs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/home/mingfu/Deep-Image-Matting-v2/models.py", line 121, in forward down2, indices_2, unpool_shape2 = self.down2(down1) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/home/mingfu/Deep-Image-Matting-v2/models.py", line 55, in forward outputs = self.conv1(inputs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/home/mingfu/Deep-Image-Matting-v2/models.py", line 43, in forward outputs = self.cbr_unit(inputs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 49, in forward self.training or not self.track_running_stats, self.momentum, self.eps) File "/home/mingfu/anaconda3/envs/pytorch-0.4/lib/python3.6/site-packages/torch/nn/functional.py", line 1194, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu:58
Would you like to share how many GPUs you used for the default batch size, --batch-size 32
?
Two GPUs, PyTorch version is: 1.0.1.post2
Hi,
Thank you for sharing your PyTorch version of reimplementation. Would you like to share the PyTorch version you used to development?
I am using PyTorch 1.0.1, CUDA 9, two RTX 2080 Ti to run the 'train.py' since I see you use Data Parallel module to support multi-GPUs training. However, I encountered and the trackbacks are here:
I have tested the DATA PARALLELISM using the example here and it works well.