HKBU-HPML / FADNet

135 stars 34 forks source link

Training GANet with out of memory #9

Closed dongli12 closed 2 years ago

dongli12 commented 4 years ago

Dear author,

I used your default params to train ganet on V100 but failed when executing "val_EPE = trainer.validate() ". I use 4 cards and set batch size to 4. Here is the part of log:

2020-11-29 18:50:45,199 [dltrainer.py:285] INFO Test: [0/4370] Time 10.4708509445 EPE 0.626178145409 Traceback (most recent call last): File "main.py", line 122, in main(opt) File "main.py", line 59, in main val_EPE = trainer.validate() File "/scratch/workspace/dongl/FADNet-master/dltrainer.py", line 240, in validate output_net3 = self.net(input_var) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 416, in forward return self.cost_agg(x, g) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 321, in forward x = self.conv_start(x) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 37, in forward x = self.conv(x) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 488.00 MiB (GPU 0; 31.72 GiB total capacity; 30.31 GiB already allocated; 211.88 MiB free; 121.01 MiB cac hed)

Could you help address this issue? Thanks.

blackjack2015 commented 4 years ago

Dear Dong,

I have fixed this bug. Please checkout the "dev" branch. The key code is "with torch.no_grad():" before conducting network inference.

Best regards, Qiang Wang

Dear author,

I used your default params to train ganet on V100 but failed when executing "val_EPE = trainer.validate() ". I use 4 cards and set batch size to 4. Here is the part of log:

2020-11-29 18:50:45,199 [dltrainer.py:285] INFO Test: [0/4370] Time 10.4708509445 EPE 0.626178145409 Traceback (most recent call last): File "main.py", line 122, in main(opt) File "main.py", line 59, in main val_EPE = trainer.validate() File "/scratch/workspace/dongl/FADNet-master/dltrainer.py", line 240, in validate output_net3 = self.net(input_var) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 416, in forward return self.cost_agg(x, g) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 321, in forward x = self.conv_start(x) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 37, in forward x = self.conv(x) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 488.00 MiB (GPU 0; 31.72 GiB total capacity; 30.31 GiB already allocated; 211.88 MiB free; 121.01 MiB cac hed)

Could you help address this issue? Thanks.

dongli12 commented 4 years ago

Thanks! It works now. BTW, the loss config fo GANet is the same as PSMNet, right? (loss_configs/psmnet_sceneflow.json). I will also try to reproduce the results of GANet and PSMNet with your code.

Best, Dong

blackjack2015 commented 4 years ago

Yes. In fact, the loss config for PSMNet and GANet is just to define the training epoch, since neither multi-scale learning nor multi-round loss weight is applied to them.

Thanks! It works now. BTW, the loss config fo GANet is the same as PSMNet, right? (loss_configs/psmnet_sceneflow.json). I will also try to reproduce the results of GANet and PSMNet with your code.

Best, Dong