cfzd / Ultra-Fast-Lane-Detection-v2

Ultra Fast Deep Lane Detection With Hybrid Anchor Driven Ordinal Classification (TPAMI 2022)
MIT License
613 stars 100 forks source link

关于训练tusimple时,开启use_aux之后无法训练 #71

Open Lrainhom opened 1 year ago

Lrainhom commented 1 year ago

1、报错RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss2d_forward,我认为是label的seg_out添加long()之后能解决 2、添加long()之后,报错RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]

cfzd commented 1 year ago

@crossover10 这个应该是分割GT不对,分割GT应该维度是1,也就是应该是32, 320, 800,但上述错误提示大小为32, 3, 320, 800,是不是通过cv2读取的时候没有取单通道。或者你也可以用类似以下的方式只取一个通道:

seg_label = seg_label[:,0]
zqs1996 commented 1 year ago

1、报错RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss2d_forward,我认为是label的seg_out添加long()之后能解决 2、添加long()之后,报错RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]

您好,请问你解决了RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]这个问题了吗? 我开启aux训练报错跟你一样的

Lrainhom commented 1 year ago

1、报错RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss2d_forward,我认为是label的seg_out添加long()之后能解决 2、添加long()之后,报错RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]

您好,请问你解决了RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [32, 3, 320, 800]这个问题了吗? 我开启aux训练报错跟你一样的

我下载了论文看,aux更像是V1的产物,V2似乎不需要aux来辅助

umie0128 commented 1 year ago

@cfzd 但是代码train是用Dali读取数据的 我在common.py 将: res_dict['seg_label'] = data_label['seg_images'] 改成: res_dict['seg_label'] = data_label['seg_images'].long()[:,0] 但是会报: image 这个是model_culane.py的SegHead()的输出维度问题 请问应该怎么解决?

umie0128 commented 1 year ago

@cfzd 如果改动SegHead()的上采样倍数 显存会爆炸 求解

cfzd commented 1 year ago

@umie0128 我们其实在tusimple上时没有加入分割的,culane上也没有加入分割。

如果你想做分割的实验的话,可以把target的尺寸缩小,报错中是4x320x1600,如果降采样8倍,那么就是4x40x200,正好和输入大小对上

umie0128 commented 1 year ago

@cfzd 今天尝试了 我跑的是CULane 在dali_data.py里面 将: seg_images = fn.crop_mirror_normalize(crop=(train_height , train_width) 改成: seg_images = fn.crop_mirror_normalize(crop=(int(train_height / 8), int(train_width / 8)) 维度是匹配好的 但是又会遇到 image 把factory.py的 if cfg.use_aux: loss_dict['op'].append(CrossEntropyLoss(weight=torch(xxx).cuda()) 改成 loss_dict['op'].append(CrossEntropyLoss() 就会触发 image 这个是在factory.py定义的 metric_dict['op'].append(Metric_mIoU(5))

cfzd commented 1 year ago

@umie0128 我注意到这么几个问题:

umie0128 commented 1 year ago

@cfzd 用的数据集确实是CULane comm的TrainCollect确实是 if cfg.dataset == 'CULane’底下的 Metric_mIoU的默认值确实是5 我改成9也不行 默认值5: image 默认值9: image

cfzd commented 1 year ago

@umie0128 我知道了,一个很简单的问题,input在送入之前要做一遍argmax把类别那个维度消掉。你要不在这个地方试试加上predict = predict.argmax(1)

https://github.com/cfzd/Ultra-Fast-Lane-Detection-v2/blob/f666202d39aedcc624248e65dd9c0604b1c6ac8c/utils/metrics.py#L19-L20

umie0128 commented 1 year ago

@cfzd 谢谢 能跑通了 目前涉及三处改动: image image image

anhuidinglieqiaodaima commented 1 year ago

@cfzd 谢谢 能跑通了 目前涉及三处改动: image image image

我如您所示改动了这些,但是 还会出现如下报错 File "/root/miniconda3/envs/lane-det/lib/python3.7/site-packages/torch/nn/functional.py", line 3026, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: input and target batch or spatial sizes don't match: target [8, 320, 1600], input [8, 9, 40, 200] 不知究竟是为何?

Gievance commented 1 month ago

@cfzd 谢谢 能跑通了 目前涉及三处改动: image image image

我如您所示改动了这些,但是 还会出现如下报错 File "/root/miniconda3/envs/lane-det/lib/python3.7/site-packages/torch/nn/functional.py", line 3026, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: input and target batch or spatial sizes don't match: target [8, 320, 1600], input [8, 9, 40, 200] 不知究竟是为何?

  1. seg_out和seg_label的尺寸大小不一样。
  2. seg_out有通道维度,需要上述argmax将该维度消掉