Closed dongli12 closed 2 years ago
Dear Dong,
I have fixed this bug. Please checkout the "dev" branch. The key code is "with torch.no_grad():" before conducting network inference.
Best regards, Qiang Wang
Dear author,
I used your default params to train ganet on V100 but failed when executing "val_EPE = trainer.validate() ". I use 4 cards and set batch size to 4. Here is the part of log:
2020-11-29 18:50:45,199 [dltrainer.py:285] INFO Test: [0/4370] Time 10.4708509445 EPE 0.626178145409 Traceback (most recent call last): File "main.py", line 122, in main(opt) File "main.py", line 59, in main val_EPE = trainer.validate() File "/scratch/workspace/dongl/FADNet-master/dltrainer.py", line 240, in validate output_net3 = self.net(input_var) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 416, in forward return self.cost_agg(x, g) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 321, in forward x = self.conv_start(x) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 37, in forward x = self.conv(x) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 478, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 488.00 MiB (GPU 0; 31.72 GiB total capacity; 30.31 GiB already allocated; 211.88 MiB free; 121.01 MiB cac hed)
Could you help address this issue? Thanks.
Thanks! It works now. BTW, the loss config fo GANet is the same as PSMNet, right? (loss_configs/psmnet_sceneflow.json). I will also try to reproduce the results of GANet and PSMNet with your code.
Best, Dong
Yes. In fact, the loss config for PSMNet and GANet is just to define the training epoch, since neither multi-scale learning nor multi-round loss weight is applied to them.
Thanks! It works now. BTW, the loss config fo GANet is the same as PSMNet, right? (loss_configs/psmnet_sceneflow.json). I will also try to reproduce the results of GANet and PSMNet with your code.
Best, Dong
Dear author,
I used your default params to train ganet on V100 but failed when executing "val_EPE = trainer.validate() ". I use 4 cards and set batch size to 4. Here is the part of log:
2020-11-29 18:50:45,199 [dltrainer.py:285] INFO Test: [0/4370] Time 10.4708509445 EPE 0.626178145409 Traceback (most recent call last): File "main.py", line 122, in
main(opt)
File "main.py", line 59, in main
val_EPE = trainer.validate()
File "/scratch/workspace/dongl/FADNet-master/dltrainer.py", line 240, in validate
output_net3 = self.net(input_var)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, kwargs)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, *kwargs)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 416, in forward
return self.cost_agg(x, g)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, kwargs)
File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 321, in forward
x = self.conv_start(x)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, *kwargs)
File "/scratch/workspace/dongl/FADNet-master/networks/GANet_deep.py", line 37, in forward
x = self.conv(x)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/scratch/workspace/dongl/anaconda2/envs/torch1.2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 478, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 488.00 MiB (GPU 0; 31.72 GiB total capacity; 30.31 GiB already allocated; 211.88 MiB free; 121.01 MiB cac
hed)
Could you help address this issue? Thanks.