使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查

yangning6103 commented 1 month ago

我打印了前推时的log， print(data["name"]) print(data["left"]) print(data["right"]) with torch.cuda.amp.autocast(enabled=self.cfgs.OPTIMIZATION.AMP): model_pred = self.model(data) infer_timer = time.time() loss, tb_info = loss_func(model_pred, data) disp_pred = model_pred['disp_pred'] print(disp_pred) print("loss",loss) 发现输入的数据没问题，但是前推输出为nan，导致loss加计算为nan， ['/IRSDataset/Store/ConvenienceStore_Day/l_566.png'] tensor([[[[ 0.6392, 0.4166, 0.2624, ..., 1.1700, 1.1872, 1.2214], [ 1.7865, 1.4098, 0.7762, ..., 1.1700, 1.1700, 1.2214], [ 2.0092, 2.0263, 1.8893, ..., 1.1529, 1.1529, 1.2214], ..., [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],

     [[-0.0049, -0.2325, -0.3725,  ..., -2.0357, -1.9832, -1.8081],
      [ 1.2206,  0.8004,  0.1352,  ..., -2.0357, -1.9132, -1.7731],
      [ 1.4657,  1.4832,  1.3081,  ..., -2.0357, -1.9132, -1.6856],
      ...,
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-1.9832, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111]],

     [[-0.8981, -1.1247, -1.2641,  ..., -1.8044, -1.6302, -1.4210],
      [ 0.2696, -0.1138, -0.7587,  ..., -1.7522, -1.6302, -1.4036],
      [ 0.4962,  0.5136,  0.3568,  ..., -1.6824, -1.6302, -1.4036],
      ...,
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.7522, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877]]]],
   device='cuda:0')

tensor([[[[ 0.2282, 0.2453, 0.2453, ..., 1.3070, 1.3242, 1.3242], [ 0.2453, 0.2453, 0.2453, ..., 1.3070, 1.3070, 1.3584], [ 0.2282, 0.2453, 0.2453, ..., 1.3242, 1.3242, 1.3584], ..., [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],

     [[-0.3375, -0.3375, -0.3200,  ..., -2.0357, -2.0357, -2.0357],
      [-0.3375, -0.3375, -0.3375,  ..., -2.0357, -2.0357, -2.0357],
      [-0.3725, -0.3375, -0.3375,  ..., -2.0357, -2.0357, -2.0357],
      ...,
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111],
      [-2.0357, -2.0357, -2.0357,  ...,  2.4111,  2.4111,  2.4111]],

     [[-0.9678, -0.9330, -0.9330,  ..., -1.8044, -1.7522, -1.8044],
      [-0.9504, -0.9504, -0.9504,  ..., -1.8044, -1.7522, -1.7522],
      [-1.0027, -0.9678, -0.9504,  ..., -1.7522, -1.7522, -1.6824],
      ...,
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.5877,  2.5877,  2.5877],
      [-1.8044, -1.8044, -1.8044,  ...,  2.6051,  2.5877,  2.5877]]]],
   device='cuda:0')

tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) 2024-10-03 11:53:39,003 INFO Training Epoch: 9/50 Iter: 947/5661 Loss:nan(nan) LR:6.7625e-04 DataTime:0.12 InferTime:43.17ms Time cost: 06:46/33:37:36 l/OpenStereo/./stereo/utils/common_utils.py:198: RuntimeWarning: invalid value encountered in cast pred_tmp = cm(pred_tmp.astype('uint8')) //OpenStereo/./stereo/utils/common_utils.py:199: RuntimeWarning: invalid value encountered in cast error_map_tmp = cm(error_map_tmp.astype('uint8')) 请问一下，这是数据有问题吗？但是现在还不知道怎么排查数据，是左右目没有对齐吗？

t973288913 commented 1 month ago

一样的情况，模型输出全为0，应该怎么处理？

XiandaGuo commented 1 month ago

The cfg will be released soon.

Dongxin000 commented 2 weeks ago

我也是自己的数据集出现了相同的问题，在第9个epoch loss出现了nan

yangning6103 commented 2 weeks ago

一样的情况，模型输出全为0，应该怎么处理？

可以暂时将配置文件中的LEFT_ATT 置为false

zjuPeco commented 2 weeks ago

一样的情况IGEV配置文件，SceneFlow+Fat数据集

XiandaGuo / OpenStereo

使用IRS数据训练，前面训练正常，但是到第9epoch时output为nan，不知道怎么排查 #141