loss.backward() error in prediction using feature refinement

FBehrad commented 2 years ago

Hello, when I want to make prediction using feature refinement, I face the following problem: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn I debugged my code, and I noticed the problem is loss.backward() when I added loss.requires_grad = True before loss.backward(), it worked, but loss remained constant. I checked the code multiple times, but I cannot find what is the problem. :(

senya-ashukha commented 2 years ago

@ankuPRK

ankuPRK commented 2 years ago

Hey, thanks for raising the issue. Could you share the command you ran for running the feature-refinement? Also, could you share:

the branch you're on, by running the command: git status
the GPU configuration, by running: nvidia-smi
your Pytorch version

It seems to be working fine on my system, so loss.requires_grad shouldn't need to be set.

FBehrad commented 2 years ago

FBehrad commented 2 years ago

I have debugged my code. I've found that the grad_fn argument of output (output = forward_rear(input_feat)) is None. Then when I use this prediction for calculating loss, the grad_fn of loss becomes None again. Therefore, when I want to use loss.backward(), torch doesn't know how to compute loss. because loss is detached from the computation graph.

ankuPRK commented 2 years ago

If I see correctly, your system has one GPU with 12GB memory. Unfortunately our algorithm requires 24GB total GPU memory. You can run it by replacing gpu_ids: 0,1 -> 0, and px_budget: 1800000 -> 900000 in the config file: https://github.com/saic-mdal/lama/blob/bd69ec300be277281dcbdcf53314f27ebb6e812c/configs/prediction/default.yaml#L17

But in this case, if it was a GPU issue your system would have thrown RuntimeError: CUDA error: invalid device ordinal so I'm guessing you have already fixed this part.

Assuming the above mentioned is taken care of, I tried running the refinement with Pytorch 1.12.0, and then tried printing output_feat.grad_fn at this line https://github.com/saic-mdal/lama/blob/bd69ec300be277281dcbdcf53314f27ebb6e812c/saicinpainting/evaluation/refinement.py#L144:

It is showing Sigmoid in my case:

(Pdb) output_feat.grad_fn
<SigmoidBackward0 object at 0x7faca6af4f40>

And the grad_fn for loss from this position: https://github.com/saic-mdal/lama/blob/bd69ec300be277281dcbdcf53314f27ebb6e812c/saicinpainting/evaluation/refinement.py#L163

is showing to be Addition:

(Pdb) loss.grad_fn
<AddBackward0 object at 0x7fec3538a790>

FBehrad commented 2 years ago

Thank you so much for your answer. I found the problem. I was calling refine_predict(batch, self.model, refiner_config) in torch.no_grad(). Therefore, the computation graph did not create at all.

ankuPRK commented 2 years ago

Awesome :) Hoping it is working normally now. If anything else comes up, please feel free to reopen the issue, or raise another one.

advimman / lama

loss.backward() error in prediction using feature refinement #141