JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.39k stars 830 forks source link

assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." #704

Closed WZR8277 closed 11 months ago

WZR8277 commented 1 year ago

作者您好,我在拿自己的数据集(训练集一共74个类别,624张图片)训练时遇见这个报错:assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."。 网上暂时没有找到合适的办法(排查时候losses=70为正常值,并且在loss前添加torch.nan_to_num也无效,在使用预训练模型和不使用预训练模型情况下都遇见同样报错)。 请问可否提供一下解决这个问题的思路? 我的数据集配置是: BASE: ../Base-SBS.yml

INPUT: SIZE_TRAIN: [256, 256] SIZE_TEST: [256, 256]

MODEL: BACKBONE: WITH_IBN: True WITH_NL: True

SOLVER: OPT: SGD BASE_LR: 0.0001 ETA_MIN_LR: 7.7e-5

IMS_PER_BATCH: 64 MAX_EPOCH: 30 WARMUP_ITERS: 1000 FREEZE_ITERS: 1000 BIAS_LR_FACTOR: 1.0 HEADS_LR_FACTOR: 1.0 WEIGHT_DECAY_BIAS: 0.0005 CHECKPOINT_PERIOD: 10 MOMENTUM: 0.9 NESTEROV: False

DATASETS: NAMES: ("ShipReid",) TESTS: ("ShipReid",)

DATALOADER: SAMPLER_TRAIN: BalancedIdentitySampler

TEST: EVAL_PERIOD: 10 IMS_PER_BATCH: 256

OUTPUT_DIR: logs/ShipReid/sbs_R50-ibn

详细报错日志如下: Traceback (most recent call last): File "tools/train_net.py", line 51, in launch( File "./fastreid/engine/launch.py", line 71, in launch main_func(*args) File "tools/train_net.py", line 45, in main return trainer.train() File "./fastreid/engine/defaults.py", line 348, in train super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch) File "./fastreid/engine/train_loop.py", line 145, in train self.run_step() File "./fastreid/engine/defaults.py", line 357, in run_step self._trainer.run_step() File "./fastreid/engine/train_loop.py", line 351, in run_step self.grad_scaler.step(self.optimizer) File "/home/wangzhaorong/.local/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." AssertionError: No inf checks were recorded for this optimizer.

github-actions[bot] commented 12 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jackshi1997 commented 9 months ago

请问后来这个问题解决了吗?

WZR8277 commented 9 months ago

请问后来这个问题解决了吗?

把amp关了就能训练了

huiyiygy commented 8 months ago

Hi everyone, I also encountered this issue in Pytorch 2.0, and after debugging the code, I found a simple solution that does not require lowering the Pytorch version。 Just need to Set 'contiguous=False' in line 380 of defaults.py

return build_optimizer(cfg, model)
# convert to
return build_optimizer(cfg, model, contiguous=False)

In this way, it is avoided to convert the network's params to ContiguousParams.

I debugged code and found that if using ContiguousParams, after loss.backword() , the calculated gradient is None, and leading to training errors.

This also work while trainning with AMP off

After setting'contiguous=False', the network can train smoothly and may achieve higher precision : )