关于训练tusimple时，开启use_aux之后无法训练

Lrainhom commented 1 year ago

1、报错RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss2d_forward，我认为是label的seg_out添加long()之后能解决 2、添加long()之后，报错RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]

cfzd commented 1 year ago

@crossover10 这个应该是分割GT不对，分割GT应该维度是1，也就是应该是32, 320, 800，但上述错误提示大小为32, 3, 320, 800，是不是通过cv2读取的时候没有取单通道。或者你也可以用类似以下的方式只取一个通道:

seg_label = seg_label[:,0]

zqs1996 commented 1 year ago

1、报错RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss2d_forward，我认为是label的seg_out添加long()之后能解决 2、添加long()之后，报错RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]

您好，请问你解决了RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]这个问题了吗？我开启aux训练报错跟你一样的

Lrainhom commented 1 year ago

1、报错RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss2d_forward，我认为是label的seg_out添加long()之后能解决 2、添加long()之后，报错RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]

您好，请问你解决了RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]这个问题了吗？我开启aux训练报错跟你一样的

我下载了论文看，aux更像是V1的产物，V2似乎不需要aux来辅助

umie0128 commented 1 year ago

@cfzd 但是代码train是用Dali读取数据的我在common.py 将： res_dict['seg_label'] = data_label['seg_images'] 改成： res_dict['seg_label'] = data_label['seg_images'].long()[:,0] 但是会报：这个是model_culane.py的SegHead()的输出维度问题请问应该怎么解决？

umie0128 commented 1 year ago

@cfzd 如果改动SegHead()的上采样倍数显存会爆炸求解

cfzd commented 1 year ago

@umie0128 我们其实在tusimple上时没有加入分割的，culane上也没有加入分割。

如果你想做分割的实验的话，可以把target的尺寸缩小，报错中是4x320x1600，如果降采样8倍，那么就是4x40x200，正好和输入大小对上

umie0128 commented 1 year ago

@cfzd 今天尝试了我跑的是CULane 在dali_data.py里面将: seg_images = fn.crop_mirror_normalize(crop=(train_height , train_width) 改成: seg_images = fn.crop_mirror_normalize(crop=(int(train_height / 8), int(train_width / 8)) 维度是匹配好的但是又会遇到把factory.py的 if cfg.use_aux: loss_dict['op'].append(CrossEntropyLoss(weight=torch(xxx).cuda()) 改成 loss_dict['op'].append(CrossEntropyLoss() 就会触发这个是在factory.py定义的 metric_dict['op'].append(Metric_mIoU(5))

cfzd commented 1 year ago

@umie0128 我注意到这么几个问题：

你的车道线数目好像是8条，这个是正确的吗？因为CULane上一般只有4条车道线（但其实这个应该不影响，你分割map的后几个channel应该永远不会被激活）
用于统计训练时miou的类别数是不是没有设置正确，默认是4条车道线（4+1背景类5个分割类别） https://github.com/cfzd/Ultra-Fast-Lane-Detection-v2/blob/849fa7b90c189b646d12adc1d807bec54e982031/utils/factory.py#L74

umie0128 commented 1 year ago

@cfzd 用的数据集确实是CULane comm的TrainCollect确实是 if cfg.dataset == 'CULane’底下的 Metric_mIoU的默认值确实是5 我改成9也不行默认值5：默认值9：

cfzd commented 1 year ago

@umie0128 我知道了，一个很简单的问题，input在送入之前要做一遍argmax把类别那个维度消掉。你要不在这个地方试试加上predict = predict.argmax(1)

https://github.com/cfzd/Ultra-Fast-Lane-Detection-v2/blob/f666202d39aedcc624248e65dd9c0604b1c6ac8c/utils/metrics.py#L19-L20

umie0128 commented 1 year ago

@cfzd 谢谢能跑通了目前涉及三处改动：

anhuidinglieqiaodaima commented 1 year ago

@cfzd 谢谢能跑通了目前涉及三处改动：

我如您所示改动了这些，但是还会出现如下报错 File "/root/miniconda3/envs/lane-det/lib/python3.7/site-packages/torch/nn/functional.py", line 3026, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: input and target batch or spatial sizes don't match: target [8, 320, 1600], input [8, 9, 40, 200] 不知究竟是为何?

Gievance commented 1 month ago

@cfzd 谢谢能跑通了目前涉及三处改动：

我如您所示改动了这些，但是还会出现如下报错 File "/root/miniconda3/envs/lane-det/lib/python3.7/site-packages/torch/nn/functional.py", line 3026, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: input and target batch or spatial sizes don't match: target [8, 320, 1600], input [8, 9, 40, 200] 不知究竟是为何?

seg_out和seg_label的尺寸大小不一样。
seg_out有通道维度,需要上述argmax将该维度消掉

cfzd / Ultra-Fast-Lane-Detection-v2

关于训练tusimple时，开启use_aux之后无法训练 #71