Open yangning6103 opened 1 month ago
我打印了前推时的log, print(data["name"]) print(data["left"]) print(data["right"]) with torch.cuda.amp.autocast(enabled=self.cfgs.OPTIMIZATION.AMP): model_pred = self.model(data) infer_timer = time.time() loss, tb_info = loss_func(model_pred, data) disp_pred = model_pred['disp_pred'] print(disp_pred) print("loss",loss) 发现输入的数据没问题,但是前推输出为nan,导致loss加计算为nan, ['/IRSDataset/Store/ConvenienceStore_Day/l_566.png'] tensor([[[[ 0.6392, 0.4166, 0.2624, ..., 1.1700, 1.1872, 1.2214], [ 1.7865, 1.4098, 0.7762, ..., 1.1700, 1.1700, 1.2214], [ 2.0092, 2.0263, 1.8893, ..., 1.1529, 1.1529, 1.2214], ..., [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],
[[-0.0049, -0.2325, -0.3725, ..., -2.0357, -1.9832, -1.8081], [ 1.2206, 0.8004, 0.1352, ..., -2.0357, -1.9132, -1.7731], [ 1.4657, 1.4832, 1.3081, ..., -2.0357, -1.9132, -1.6856], ..., [-2.0357, -2.0357, -2.0357, ..., 2.4111, 2.4111, 2.4111], [-2.0357, -2.0357, -2.0357, ..., 2.4111, 2.4111, 2.4111], [-1.9832, -2.0357, -2.0357, ..., 2.4111, 2.4111, 2.4111]], [[-0.8981, -1.1247, -1.2641, ..., -1.8044, -1.6302, -1.4210], [ 0.2696, -0.1138, -0.7587, ..., -1.7522, -1.6302, -1.4036], [ 0.4962, 0.5136, 0.3568, ..., -1.6824, -1.6302, -1.4036], ..., [-1.8044, -1.8044, -1.8044, ..., 2.5877, 2.5877, 2.5877], [-1.8044, -1.8044, -1.8044, ..., 2.5877, 2.5877, 2.5877], [-1.7522, -1.8044, -1.8044, ..., 2.5877, 2.5877, 2.5877]]]], device='cuda:0')
tensor([[[[ 0.2282, 0.2453, 0.2453, ..., 1.3070, 1.3242, 1.3242], [ 0.2453, 0.2453, 0.2453, ..., 1.3070, 1.3070, 1.3584], [ 0.2282, 0.2453, 0.2453, ..., 1.3242, 1.3242, 1.3584], ..., [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],
[[-0.3375, -0.3375, -0.3200, ..., -2.0357, -2.0357, -2.0357], [-0.3375, -0.3375, -0.3375, ..., -2.0357, -2.0357, -2.0357], [-0.3725, -0.3375, -0.3375, ..., -2.0357, -2.0357, -2.0357], ..., [-2.0357, -2.0357, -2.0357, ..., 2.4111, 2.4111, 2.4111], [-2.0357, -2.0357, -2.0357, ..., 2.4111, 2.4111, 2.4111], [-2.0357, -2.0357, -2.0357, ..., 2.4111, 2.4111, 2.4111]], [[-0.9678, -0.9330, -0.9330, ..., -1.8044, -1.7522, -1.8044], [-0.9504, -0.9504, -0.9504, ..., -1.8044, -1.7522, -1.7522], [-1.0027, -0.9678, -0.9504, ..., -1.7522, -1.7522, -1.6824], ..., [-1.8044, -1.8044, -1.8044, ..., 2.5877, 2.5877, 2.5877], [-1.8044, -1.8044, -1.8044, ..., 2.5877, 2.5877, 2.5877], [-1.8044, -1.8044, -1.8044, ..., 2.6051, 2.5877, 2.5877]]]], device='cuda:0')
tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) 2024-10-03 11:53:39,003 INFO Training Epoch: 9/50 Iter: 947/5661 Loss:nan(nan) LR:6.7625e-04 DataTime:0.12 InferTime:43.17ms Time cost: 06:46/33:37:36 l/OpenStereo/./stereo/utils/common_utils.py:198: RuntimeWarning: invalid value encountered in cast pred_tmp = cm(pred_tmp.astype('uint8')) //OpenStereo/./stereo/utils/common_utils.py:199: RuntimeWarning: invalid value encountered in cast error_map_tmp = cm(error_map_tmp.astype('uint8')) 请问一下,这是数据有问题吗?但是现在还不知道怎么排查数据,是左右目没有对齐吗?
一样的情况,模型输出全为0,应该怎么处理?
The cfg will be released soon.
我也是自己的数据集出现了相同的问题,在第9个epoch loss出现了nan
可以暂时将配置文件中的LEFT_ATT 置为false
一样的情况IGEV配置文件,SceneFlow+Fat数据集
我打印了前推时的log, print(data["name"]) print(data["left"]) print(data["right"]) with torch.cuda.amp.autocast(enabled=self.cfgs.OPTIMIZATION.AMP): model_pred = self.model(data) infer_timer = time.time() loss, tb_info = loss_func(model_pred, data) disp_pred = model_pred['disp_pred'] print(disp_pred) print("loss",loss) 发现输入的数据没问题,但是前推输出为nan,导致loss加计算为nan, ['/IRSDataset/Store/ConvenienceStore_Day/l_566.png'] tensor([[[[ 0.6392, 0.4166, 0.2624, ..., 1.1700, 1.1872, 1.2214], [ 1.7865, 1.4098, 0.7762, ..., 1.1700, 1.1700, 1.2214], [ 2.0092, 2.0263, 1.8893, ..., 1.1529, 1.1529, 1.2214], ..., [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],
tensor([[[[ 0.2282, 0.2453, 0.2453, ..., 1.3070, 1.3242, 1.3242], [ 0.2453, 0.2453, 0.2453, ..., 1.3070, 1.3070, 1.3584], [ 0.2282, 0.2453, 0.2453, ..., 1.3242, 1.3242, 1.3584], ..., [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.1179, -2.1179, -2.1179, ..., 2.2489, 2.2489, 2.2489], [-2.0665, -2.1179, -2.0665, ..., 2.2489, 2.2489, 2.2489]],
tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
2024-10-03 11:53:39,003 INFO Training Epoch: 9/50 Iter: 947/5661 Loss:nan(nan) LR:6.7625e-04 DataTime:0.12 InferTime:43.17ms Time cost: 06:46/33:37:36
l/OpenStereo/./stereo/utils/common_utils.py:198: RuntimeWarning: invalid value encountered in cast
pred_tmp = cm(pred_tmp.astype('uint8'))
//OpenStereo/./stereo/utils/common_utils.py:199: RuntimeWarning: invalid value encountered in cast
error_map_tmp = cm(error_map_tmp.astype('uint8'))
请问一下,这是数据有问题吗?但是现在还不知道怎么排查数据,是左右目没有对齐吗?