Closed WZR8277 closed 11 months ago
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
请问后来这个问题解决了吗?
请问后来这个问题解决了吗?
把amp关了就能训练了
Hi everyone, I also encountered this issue in Pytorch 2.0, and after debugging the code, I found a simple solution that does not require lowering the Pytorch version。 Just need to Set 'contiguous=False' in line 380 of defaults.py
return build_optimizer(cfg, model)
# convert to
return build_optimizer(cfg, model, contiguous=False)
In this way, it is avoided to convert the network's params to ContiguousParams.
I debugged code and found that if using ContiguousParams, after loss.backword() , the calculated gradient is None, and leading to training errors.
This also work while trainning with AMP off
After setting'contiguous=False', the network can train smoothly and may achieve higher precision : )
作者您好,我在拿自己的数据集(训练集一共74个类别,624张图片)训练时遇见这个报错:assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."。 网上暂时没有找到合适的办法(排查时候losses=70为正常值,并且在loss前添加torch.nan_to_num也无效,在使用预训练模型和不使用预训练模型情况下都遇见同样报错)。 请问可否提供一下解决这个问题的思路? 我的数据集配置是: BASE: ../Base-SBS.yml
INPUT: SIZE_TRAIN: [256, 256] SIZE_TEST: [256, 256]
MODEL: BACKBONE: WITH_IBN: True WITH_NL: True
SOLVER: OPT: SGD BASE_LR: 0.0001 ETA_MIN_LR: 7.7e-5
IMS_PER_BATCH: 64 MAX_EPOCH: 30 WARMUP_ITERS: 1000 FREEZE_ITERS: 1000 BIAS_LR_FACTOR: 1.0 HEADS_LR_FACTOR: 1.0 WEIGHT_DECAY_BIAS: 0.0005 CHECKPOINT_PERIOD: 10 MOMENTUM: 0.9 NESTEROV: False
DATASETS: NAMES: ("ShipReid",) TESTS: ("ShipReid",)
DATALOADER: SAMPLER_TRAIN: BalancedIdentitySampler
TEST: EVAL_PERIOD: 10 IMS_PER_BATCH: 256
OUTPUT_DIR: logs/ShipReid/sbs_R50-ibn
详细报错日志如下: Traceback (most recent call last): File "tools/train_net.py", line 51, in
launch(
File "./fastreid/engine/launch.py", line 71, in launch
main_func(*args)
File "tools/train_net.py", line 45, in main
return trainer.train()
File "./fastreid/engine/defaults.py", line 348, in train
super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
File "./fastreid/engine/train_loop.py", line 145, in train
self.run_step()
File "./fastreid/engine/defaults.py", line 357, in run_step
self._trainer.run_step()
File "./fastreid/engine/train_loop.py", line 351, in run_step
self.grad_scaler.step(self.optimizer)
File "/home/wangzhaorong/.local/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.