OpenDriveLab / TCP

[NeurIPS 2022] Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline.
Apache License 2.0
309 stars 40 forks source link

question about traffic light #35

Open EcustBoy opened 1 year ago

EcustBoy commented 1 year ago

Hi~ author, In my opinion, TCP model directly use raw image and some measurement signal as input, and doesn't consider intermediate perception results. But how does it learn traffic light information? If only rely on expert trajectory samples to train, I think the traffic light is too small in front view such that it's actually hard to learn "red-stop, green-start" behavior?

Besides, does training dataset size has crucial impact on the final performance of understanding traffic light? Whether there are relevant ablation experiments about this?

penghao-wu commented 1 year ago

Yes, it learns the "red-stop, green-start" from the expert demonstrations. And I think the current camera setup could capture the traffic light information. But you can also try to add another camera with an explicit traffic light detection module to enhance its ability similar to LAV.

Most of the training routes contain junctions with traffic lights, so the traffic light related data is abundant. I think the dataset size is important to learn rules about the traffic light, but we do not have such ablations.

EcustBoy commented 1 year ago

Yes, it learns the "red-stop, green-start" from the expert demonstrations. And I think the current camera setup could capture the traffic light information. But you can also try to add another camera with an explicit traffic light detection module to enhance its ability similar to LAV.

Most of the training routes contain junctions with traffic lights, so the traffic light related data is abundant. I think the dataset size is important to learn rules about the traffic light, but we do not have such ablations.

Thanks for your reply, right now I only train on my own small dataset (about 75K samples) and I haven't feed image to planner decoder directly, I think this is the main reason where my model can't learn to understanding traffic light. :-).

I'm gonna try to design similar front view feature extraction network similar to TCP, it seems that ego car is able to learn the "red-stop, green-start" behavior as long as I feed raw image to simple network and train on a relatively big dataset, instead of some complicated design, right? many thanks for your answer~

penghao-wu commented 1 year ago

So currently what is the input to your planner decoder if you do not feed the image features to it?

EcustBoy commented 1 year ago

So currently what is the input to your planner decoder if you do not feed the image features to it?

actually I input (1)other cars and map detection embedding feature which are output by the front backbone and detection head, and (2)some ego car state(including command waypoint and speed), So I think I shouldn't only use the intermediate feature, it seems the raw image is also needed.

penghao-wu commented 1 year ago

Yes, you need to include information containing traffic light information (like raw images or traffic light detection results) as input.

EcustBoy commented 1 year ago

Yes, you need to include information containing traffic light information (like raw images or traffic light detection results) as input.

Hi~author, I read your code again and notice you use pretrained resnet34 to get image feature.

I wanna ask is a pretrained image feature network backbone necessary if I only wanna get traffic light info from front-view? For limit the network size, perhaps a shallow custom-designed network is already enough? Not sure whether you‘ve made such comparison~

penghao-wu commented 1 year ago

I think a shallow network would suffice if you have direct supervision on the traffic light states.