JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.39k stars 830 forks source link

No inf checks were recorded for this optimizer. #700

Closed tiamo405 closed 1 year ago

tiamo405 commented 1 year ago

file Base-SBS.yml

SOLVER:
  AMP:
    ENABLED: True
  OPT: Adam

error:

 File "/mnt/nvme0n1/miniconda3/envs/fast-reid/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 368, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.

code run : CUDA_VISIBLE_DEVICES=0 python tools/train_net.py \ --config-file ./configs/Market1501/sbs_R50.yml \ MODEL.DEVICE "cuda:0" can someone help me? thanks

anaatef9 commented 1 year ago

hey, i am facing the same issue when running the getting started code, did you figure out the problem?

update: just removed MODEL.DEVICE "cuda:0 from the command line

run this: python3 tools/train_net.py --config-file ./configs/Market1501/bagtricks_R50.yml"

pierrez99 commented 1 year ago

Hi. I was facing the same issue today on the Market1501 dataset, and in my case I think it had to do with self.optimizer.zero_grad() (found in _engine/trainloop.py). After changing this to self.optimizer.zero_grad(set_to_none=False) I started getting a different error, related to my GPU.

The problem with my GPU was that it was not able to allocate enough memory. I managed to fix this by adding a parameter to the autocast, in the same file, as follows with autocast(dtype=torch.float16):. Moreover, I also played around with the IMS_PER_BATCH in Base-bagtricks.yml. I had to reduce the batch size for both the SOLVER and the TEST fields. I also added torch.cuda.empty_cache() right after the self.grad_scaler.update(), but I don't think it is necessary. Emptying the cache ensured it worked during training but it also increased training time.

I am quite new to this so I am not sure if this was the best approach but I hope it helps.

tiamo405 commented 1 year ago

Hi. I was facing the same issue today on the Market1501 dataset, and in my case I think it had to do with self.optimizer.zero_grad() (found in _engine/trainloop.py). After changing this to self.optimizer.zero_grad(set_to_none=False) I started getting a different error, related to my GPU.

The problem with my GPU was that it was not able to allocate enough memory. I managed to fix this by adding a parameter to the autocast, in the same file, as follows with autocast(dtype=torch.float16):. Moreover, I also played around with the IMS_PER_BATCH in Base-bagtricks.yml. I had to reduce the batch size for both the SOLVER and the TEST fields. I also added torch.cuda.empty_cache() right after the self.grad_scaler.update(), but I don't think it is necessary. Emptying the cache ensured it worked during training but it also increased training time.

I am quite new to this so I am not sure if this was the best approach but I hope it helps.

hey, i am facing the same issue when running the getting started code, did you figure out the problem?

update: just removed MODEL.DEVICE "cuda:0 from the command line

run this: python3 tools/train_net.py --config-file ./configs/Market1501/bagtricks_R50.yml"

my workaround was to downgrade pytorch version. pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 then the program works normally

michal-kierzynka commented 1 year ago

In my case adding set_to_none parameter to optimizer.zero_grad fixed the problem: self.optimizer.zero_grad(set_to_none=False).

huiyiygy commented 9 months ago

Hi everyone, I also encountered this issue in Pytorch 2.0, and after debugging the code, I found a simple solution that does not require lowering the Pytorch version。 Just need to Set 'continuous=False' in line 380 of defaults.py

return build_optimizer(cfg, model)
# convert to
return build_optimizer(cfg, model, continuous=False)

In this way, it is avoided to convert the network's params to ContiguousParams.

I debugged code and found that if using ContiguousParams, after loss.backword() , the calculated gradient is None, and leading to training errors.

After setting'continuous=False', the network can train smoothly and may achieve higher precision : )

jveitchmichaelis commented 8 months ago

@huiyiygy Thanks for this! Minor typo, it should be contiguous=False, not continuous. Otherwise this seems to work and while you can train with AMP off, I get pretty terrible loss curves. With this set, loss immediately decreases as expected.