NVlabs / pacnet

Pixel-Adaptive Convolutional Neural Networks (CVPR '19)
https://suhangpro.github.io/pac/
Other
511 stars 79 forks source link

Debugging Error #17

Closed josephdanielchang closed 4 years ago

josephdanielchang commented 4 years ago

Hi, I'm a bit confused how to deal with this error. Can you help?

/home/joseph/pacnet-master/task_jointUpsampling/main.py:122: UserWarning: genfromtxt: Empty input file: "exp/sintel/train.log" log = np.genfromtxt(log_path, delimiter=',', skip_header=1, usecols=(0,)) /home/joseph/pacnet-master/task_jointUpsampling/main.py:122: UserWarning: genfromtxt: Empty input file: "exp/sintel/test.log" log = np.genfromtxt(log_path, delimiter=',', skip_header=1, usecols=(0,)) Traceback (most recent call last): File "/home/joseph/anaconda3/envs/pac/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/joseph/anaconda3/envs/pac/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 348, in main() File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 322, in main log_test = test(model, test_loader, device, last_epoch, init_lr, args.loss, perf_measures, args) File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 86, in test output = apply_model(model, lres, guide, args.factor) File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 22, in apply_model out = net(lres, guide) File "/home/joseph/anaconda3/envs/pac/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/joseph/pacnet-master/task_jointUpsampling/models.py", line 245, in forward x = self.up_convts[i](x, guide_cur) File "/home/joseph/anaconda3/envs/pac/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, **kwargs) File "/home/joseph/pacnet-master/pac.py", line 795, in forward self.output_padding, self.dilation, self.shared_filters, self.native_impl) File "/home/joseph/pacnet-master/pac.py", line 507, in pacconv_transpose2d shared_filters) File "/home/joseph/pacnet-master/pac.py", line 261, in forward output = torch.einsum('ijklmn,jokl->iomn', (in_mul_k, weight)) File "/home/joseph/anaconda3/envs/pac/lib/python3.6/site-packages/torch/functional.py", line 211, in einsum return torch._C._VariableFunctions.einsum(equation, operands) RuntimeError: CUDA out of memory. Tried to allocate 2.69 GiB (GPU 0; 10.92 GiB total capacity; 5.74 GiB already allocated; 1.86 GiB free; 2.78 GiB cached)

suhangpro commented 4 years ago

Are you able to run testing with the provided weights? For training, 11GB memory should be enough for up to 8x upsampling, but might not for 16x.

josephdanielchang commented 4 years ago

That's odd. When I run testing, it gives a very similar error.

Command: CUDA_VISIBLE_DEVICES=4 python -m task_jointUpsampling.main --load-weights weights_flow/x8_pac_weights_epoch_5000.pth --download --factor 8 --model PacJointUpsample --dataset Sintel --data-root data/sintel

Output:

TEST LOADER START
TEST LOADER END

Model weights initialized from: weights_flow/x8_pac_weights_epoch_5000.pth
TEST START
BEFORE APPLY MODEL
BEFORE NET
AFTER NET
AFTER APPLY MODEL
BEFORE APPLY MODEL
BEFORE NET
Traceback (most recent call last):
  File "/home/joseph/anaconda3/envs/pac/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/joseph/anaconda3/envs/pac/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 362, in <module>
    main()
  File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 335, in main
    log_test = test(model, test_loader, device, last_epoch, init_lr, args.loss, perf_measures, args)                   # TEST
  File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 89, in test
    output = apply_model(model, lres, guide, args.factor)
  File "/home/joseph/pacnet-master/task_jointUpsampling/main.py", line 23, in apply_model
    out = net(lres, guide)
  File "/home/joseph/anaconda3/envs/pac/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/joseph/pacnet-master/task_jointUpsampling/models.py", line 245, in forward
    x = self.up_convts[i](x, guide_cur)
  File "/home/joseph/anaconda3/envs/pac/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/joseph/pacnet-master/pac.py", line 786, in forward
    self.output_padding, self.dilation, self.shared_filters, self.native_impl)
  File "/home/joseph/pacnet-master/pac.py", line 498, in pacconv_transpose2d
    shared_filters)
  File "/home/joseph/pacnet-master/pac.py", line 252, in forward
    output = torch.einsum('ijklmn,jokl->iomn', (in_mul_k, weight))
  File "/home/joseph/anaconda3/envs/pac/lib/python3.6/site-packages/torch/functional.py", line 211, in einsum
    return torch._C._VariableFunctions.einsum(equation, operands)
RuntimeError: CUDA out of memory. Tried to allocate 2.69 GiB (GPU 0; 10.92 GiB total capacity; 5.74 GiB already allocated; 1.86 GiB free; 2.78 GiB cached)
suhangpro commented 4 years ago

This is indeed odd ... which pytorch version are you using? The code was originally developed for 0.4, but there is an experimental branch for 1.4 which you might try out.

josephdanielchang commented 4 years ago

Using python >> import torch >> print(torch.version), mine is 1.1.0 I am running the optical flow test on Sintel data with weights_flow x8_pac_weights_epoch_50000.pth

python -m task_jointUpsampling.main --load-weights weights_flow/x8_pac_weights_epoch_5000.pth --download --factor 8 --model PacJointUpsample --dataset Sintel --data-root data/sintel

How many GB of GPU would you estimate is necessary to run the test program?

suhangpro commented 4 years ago

11GB GPUs should be enough for both training (w/ the exception of some 16x models) and testing. versions >1.0 are not supported by the master branch (I expect some test cases to fail as well). The th14 branch is to be used with version 1.4, but has not been thoroughly tested.

josephdanielchang commented 4 years ago

So, 1.0 should work then correct? Should I downgrade and test again or do you have other suggestions?

suhangpro commented 4 years ago

You can downgrade to 1.0 or upgrade to 1.4 (and use the th14 branch).

josephdanielchang commented 4 years ago

I downgraded to 1.0.0 and it still has GPU out of memory error for testing flow. Is the data-root supposed to be: --data-root data/sintel? There are a lot of folders under the data-root, should I specify a particular folder?

suhangpro commented 4 years ago

@josephdanielchang I just tested on 11GB mem GPU and found that indeed the 8x and 16x flow tests won't work. Sorry that I didn't provide clear information before. With a 11GB mem GPU, you are able to run all depth experiments and only 4x flow experiments.

The data path is correct as is.

josephdanielchang commented 4 years ago

Thanks, it does work with only 4x for flow. Followup question, where do I find the results for the these "upsampled" flow after running flow test on the sintel flow data? I only find a folder exp/sintel with test.log and train.log, but no .flo files generated anywhere. Is there supposed to be no output?

suhangpro commented 4 years ago

Right, the code is for quantitative evaluation only and does not save results (for the semantic segmentation code though we do have a "--eval pred" option for this purpose).