advimman / lama

🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022
https://advimman.github.io/lama-project/
Apache License 2.0
7.99k stars 849 forks source link

loss.backward() error in prediction using feature refinement #141

Closed FBehrad closed 2 years ago

FBehrad commented 2 years ago

Hello, when I want to make prediction using feature refinement, I face the following problem: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn I debugged my code, and I noticed the problem is loss.backward() when I added loss.requires_grad = True before loss.backward(), it worked, but loss remained constant. I checked the code multiple times, but I cannot find what is the problem. :(

senya-ashukha commented 2 years ago

@ankuPRK

ankuPRK commented 2 years ago

Hey, thanks for raising the issue. Could you share the command you ran for running the feature-refinement? Also, could you share:

It seems to be working fine on my system, so loss.requires_grad shouldn't need to be set.

FBehrad commented 2 years ago

Thank you so much for your quick response, I'm running this code on my local computer. My torch version is: 1.12.0+cu116 My GPU configuration: | NVIDIA-SMI 512.77 Driver Version: 512.77 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr: Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A | | 53% 48C P2 47W / 170W | 9001MiB / 12288MiB | 1% Default | | | | N/A

FBehrad commented 2 years ago

I have debugged my code. I've found that the grad_fn argument of output (output = forward_rear(input_feat)) is None. Then when I use this prediction for calculating loss, the grad_fn of loss becomes None again. Therefore, when I want to use loss.backward(), torch doesn't know how to compute loss. because loss is detached from the computation graph.

ankuPRK commented 2 years ago

If I see correctly, your system has one GPU with 12GB memory. Unfortunately our algorithm requires 24GB total GPU memory. You can run it by replacing gpu_ids: 0,1 -> 0, and px_budget: 1800000 -> 900000 in the config file: https://github.com/saic-mdal/lama/blob/bd69ec300be277281dcbdcf53314f27ebb6e812c/configs/prediction/default.yaml#L17

But in this case, if it was a GPU issue your system would have thrown RuntimeError: CUDA error: invalid device ordinal so I'm guessing you have already fixed this part.

Assuming the above mentioned is taken care of, I tried running the refinement with Pytorch 1.12.0, and then tried printing output_feat.grad_fn at this line https://github.com/saic-mdal/lama/blob/bd69ec300be277281dcbdcf53314f27ebb6e812c/saicinpainting/evaluation/refinement.py#L144:

It is showing Sigmoid in my case:

(Pdb) output_feat.grad_fn
<SigmoidBackward0 object at 0x7faca6af4f40>

And the grad_fn for loss from this position: https://github.com/saic-mdal/lama/blob/bd69ec300be277281dcbdcf53314f27ebb6e812c/saicinpainting/evaluation/refinement.py#L163

is showing to be Addition:

(Pdb) loss.grad_fn
<AddBackward0 object at 0x7fec3538a790>
FBehrad commented 2 years ago

Thank you so much for your answer. I found the problem. I was calling refine_predict(batch, self.model, refiner_config) in torch.no_grad(). Therefore, the computation graph did not create at all.

ankuPRK commented 2 years ago

Awesome :) Hoping it is working normally now. If anything else comes up, please feel free to reopen the issue, or raise another one.