Question about loss in the training phase

Kin-Zhang commented 2 years ago

Question about the training phase:

The whole dataset has some data.mbd loss in folders, and the provided dataset frame is shown here:
At training phase: End-to-end Training, I downloaded the whole dataset and it's wried at some of the loss, I'd like to ask that is it correct or just normal for these kinds of loss?

The training detail is same as the default config.yml with all phases are trained in 100 epochs and bev with 160 epochs with suitable batch size, the evaluation is really terrible at the online leaderboard shown I just finished the first five route and see the result file notice the collision is serious, I'm wondering is there any process I miss that cannot reproduce the effect to evaluate? since with only 1% data loss will not effect so much on the training model.

Thanks again for your work!

dotchen commented 2 years ago

Thanks for reporting this. There must have been some glitch when I mirror them to box. Is this from the raw dataset or the bundled compression files?
The loss trend looks correct. Attached is loss trend on my end for the last phase. The perception backbone and heads are already trained in phase 1 so they will not decrease too much but rather serve as regularization loss. Regarding evaluation though, I would need some more details. Could you tell me 1. what routes did you evaluate on and 2. what batch sizes did you train each phase with? 3. as a sanity check, how does the released checkpoint perform on your setup?

loss

dotchen commented 2 years ago

Also I would stick to the default number of epochs. I did not try that many epochs so I do not know what performance would look like. Although there is some data augmentation, it might be that model starts to overfit with 100 epochs.

Kin-Zhang commented 2 years ago

since the bundled compression files have some problems with my extract process always stuck, I downloaded the raw dataset through the folder.

I just use the online leaderboard to evaluate, the route file is private. I just tried the first five route and kill it to make sure the effect since it's enough and the whole process needs a long time
the batch size and epoch table is here: batch_size * gpu_num, since I rewrite DP to DDP to speed up the training process.

phase batch size epoch default epoch*

train_bev 512 160 160

train_bra 64*2 100 10

train_seg 158*4 100 1

train_full perceive_only 12*8 100 15

train_full 20*8 100 15

and of course, I run the lidar paint with trained segmentation model above.
Is the default epoch* you mentioned about the repo said as I attach in the table, since I saw segmentation loss still decrease after 1 epoch, it not enough for train_seg with 1 epoch as default.
The default weights perform even better with my whole dataset trained but it also has serious collision and didn't perform the effect as LAV on online leaderboard ... that's why I'm curisous about it.

phase	batch size	epoch	default epoch*
train_bev	512	160	160
train_bra	64*2	100	10
train_seg	158*4	100	1
train_full perceive_only	12*8	100	15
train_full	20*8	100	15

Here is more information about the loss figure I said in seg and bra training:

Thanks for your help.

dotchen commented 2 years ago

I just use the online leaderboard to evaluate, the route file is private. I just tried the first five route and kill it to make sure the effect since it's enough and the whole process needs a long time

The first 5 runs would not be indicative because it is 5 repetitions of the first route, and routes have varying difficulties. I would be able to help more if you evaluate on public routes and have a visualization.

Is the default epoch* you mentioned about the repo said as I attach in the table, since I saw segmentation loss still decrease after 1 epoch, it not enough for train_seg with 1 epoch as default. Yes. Our online leaderboard entry also uses these number of epochs settings.

I would not insist on 100 epochs since loss decreasing does not mean better performance at test time. The loss plotted here is the training loss, and even if it is validation loss it would not mean the model will drive better due to distribution mismatch.

Kin-Zhang commented 2 years ago

The first 5 runs would not be indicative because it is 5 repetitions of the first route Our online leaderboard entry also uses these number of epochs settings.

Oh, I noticed now.. thanks for reminding. I would try the default settings epoch to see whether it performs better.

if you evaluate on public routes and have a visualization.

what's information need to be visualized? the bev map, seg effect or detection?

and even if it is validation loss it would not mean the model will drive better due to distribution mismatch.

Thanks for guiding me on this, I will try again. But how to evaluate that epoch 1 is the best choice? or how did you select those default epoch?

Really appreciate.

dotchen commented 2 years ago

Thanks for guiding me on this, I will try again. But how to evaluate that epoch 1 is the best choice? or how did you select those default epoch?

Note that the only model that trains with 1 epoch is semantic segmentation which is not an end task model. What is needed is only the semantic scores that got painted to LiDAR. Sure you will get higher IoU if you train and wait longer but as long as the most useful stuff, i.e. roads, vehicles, pedestrians get reasonably segmented it would suffice as a sensor fusion model. It also likely does not need refined segmentation masks due to the low resolution of 64-ray lidar.

P.S. Apologize in advance I will not be able to reply on this issue as promptly as before since I will start to get busy with other work

Kin-Zhang commented 2 years ago

I will not be able to reply on this issue as promptly as before

It's okay, I will just leave the message here. And really appreciate for replying to our questions.

Sure you will get higher IoU if you train and wait longer but as long as the most useful stuff, i.e. roads, vehicles, pedestrians get reasonably segmented it would suffice as a sensor fusion model. It also likely does not need refined segmentation masks due to the low resolution of 64-ray lidar.

Thanks for explaining it. But for train_full produce lidar.th and uniplanner.th, did you run like 20 epochs and evaluate them one by one to select the 15 epochs for default as it achieves the best effect?

I just evaluate it at Town01 devtest.xml from the official leaderboard route file. Here is one of the scenarios that cause lots of collisions also in other routes like collisions with pedestrians:

https://user-images.githubusercontent.com/35365764/162203224-fe768f02-b5d5-4b2a-a1d9-c02f15d66aed.mp4

The model I used is trained as above said dataset with selected the earlier epoch:	phase	batch size	epoch
train_bev	512	160	160
train_bra	64*2	10	10
train_seg	158*4	10	1
train_full perceive_only	12*8	20	15
train_full	20*8	20	15

As I saw in the video, is the detection too slow to detect the cyclist? But the seg result is great, so I have no idea how to improve it as the LAV origin performs like..

dotchen commented 2 years ago

Aha, so you have encountered one of the most hilarious collisions in the current scenario runner setup. Note how the cyclist does not move until the ego vehicle has already passed it. This is because the triggerbox location is not properly setup for this particular scenario. It would also be inappropriate if the ego car stop before the cyclist has entered the road. I do not have a neat solution for this, and even our leaderboard entry suffer from this. But if you really want to avoid this, you can make the car go slower or faster, so the cyclist is triggered while the car is approaching or already passed it.

Kin-Zhang commented 2 years ago

Thanks for letting me know about it. Really appreciate.

you can make the car go slower or faster, so the cyclist is triggered while the car is approaching or already passed it.

So the default speed as config.yaml is shown here is not default that the online LAV? since the speed may be too fast?

https://github.com/dotchen/LAV/blob/04d23bfd68692ed5f9f57ce77decc5f0eb821d40/config.yaml#L75-L76

I also notice that with trained full LAV, there are a lot red light infractions, but it seems that online LAV did really great since at first five routes(repeat just first route) it didn't occur any red light infractions. It may relate to bra.th?
But for train_full produce lidar.th and uniplanner.th, did you run like 20 epochs and evaluate them one by one to select the 15 epochs for default as it achieves the best effect?

I think that's all problems I have, really thanks for all these replies.

dotchen commented 2 years ago

Our online leaderboard also uses 35km/h as the speed cap. What I said in my reply was that, even when I evaluate our online leaderboard entry on local routes, I have seen such error modes. But if you really want to just avoid such errors in this particular route that you are evaluating, you can make the speed cap lower. However, this will lead to performance difference on potential other metrics, like timeouts. Does this make sense?

Also let me know how you know what our online leaderboard performs in the first 5 routes cuz even I don't that :P

Kin-Zhang commented 2 years ago

Also let me know how you know what our online leaderboard performs in the first 5 routes cuz even I don't that :P

The public entry could be viewed through the leaderboard by metric and see the infraction on each metric.

I see. I will keep these settings the same as afterward comparison of methods.

Thanks again! ❤️

zhaoy376 commented 1 year ago

Thanks for letting me know about it. Really appreciate.

you can make the car go slower or faster, so the cyclist is triggered while the car is approaching or already passed it.

So the default speed as config.yaml is shown here is not default that the online LAV? since the speed may be too fast?

https://github.com/dotchen/LAV/blob/04d23bfd68692ed5f9f57ce77decc5f0eb821d40/config.yaml#L75-L76

I also notice that with trained full LAV, there are a lot red light infractions, but it seems that online LAV did really great since at first five routes(repeat just first route) it didn't occur any red light infractions. It may relate to bra.th?

But for train_full produce lidar.th and uniplanner.th, did you run like 20 epochs and evaluate them one by one to select the 15 epochs for default as it achieves the best effect?

Hi, dotchen and Kin-Zhang! I also want to figure out how to determine the best model uniplanner. As you provide the model named uniplanner_v2_7.th in weights folder, does it mean that the model is trained for 7 epochs? However, the model trained by myself can't reach the same metrics as high as the model you provide, which is also trained for 7 epochs.

dotchen / LAV

Question about loss in the training phase #9