OpenDriveLab / TCP

[NeurIPS 2022] Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline.
Apache License 2.0
310 stars 40 forks source link

Waypoint Generation and Loss Architecture #5

Closed vaydingul closed 1 year ago

vaydingul commented 1 year ago

Hi again 😄

I have a few questions regarding the waypoint generation and loss calculation.

First of all, there are a lot of feature vector representations in the paper (i.e., $\mathbf{j_m}$, $\mathbf{j^{traj}}$, $\mathbf{j^{ctl}}$). However, I think there are some ambiguities related to them. What I assumed is the following:

  1. Let's say the output of the ResNet is $\mathbf{j_{image}}$, which is the average pooled and flattened version of $\mathbf{F}$; and, the output of the measurement encoder, which accepts (high-level navigational command, speed, target velocity) triplet as input, is $\mathbf{j_m}$.
  2. Concatenation of these individual feature vectors gives the $\mathbf{j} = Concat(\mathbf{j_{image}}, \mathbf{j_m})$, which then will be fed to GRUs.
  3. In #2, you've mentioned that the initial hidden input of GRUs is just zeros with the same dimension of the feature vector. In this case, how are we propagating the feature information to the GRUs? Multi-Step Control Prediction Part:
    • For the multi-step control prediction part, based on the Figure 3 in the paper, my pseudocode is as follows: hidden = GRU([control_action, j], hidden) # first hidden is zeros(256) j = FeatureEncoder(hidden) control_action = PolicyHead(j)
    • Then, in the loss calculation part, $\mathbf{j_0^{ctl}}$ is just $\mathbf{j}$, and, $\mathbf{j_n^{ctl}}$ ( $1 \leq n \leq K$ ) is the intermediate feature vectors as an output of Feature Encoder .

Waypoint Prediction Part:

I'd be very happy if you could give me an idea about the degree of correctness of my assumptions above regarding your model.

Waypoint Prediction: The question is about the generation of ground-truth waypoints. One natural solution might be to use the GNSS data in future frames. However, when the data is collected with Roach, due to the fluctuations in control actions (typical RL issue), waypoints become so noisy. I am attaching a video here. Note that in this video, the waypoints are strided (every fourth future GNSS data). When I collect the subsequent future datapoints directly, it nearly becomes a cluster of points rather than a spline-like curve that shows the future trajectory. Example of the not-strided case is here.

Finally, I am assuming that the waypoints are predicted with respect to the car's reference frame, not World's. Then, the average of amplitudes and rotations of vectors composed by the subsequent waypoints are fed to the longitudinal and lateral PID controller.

I know I am bugging you a lot 🥲 , but I think everything will be more clear for all readers through the answers to these questions.

penghao-wu commented 1 year ago

For the control branch, the init ${\rm{j_0^{ctl}}}$ is calculated based on the concatenation of measurement feature and the image feature from aggregating ${\rm{F}}$ using an init attention map, while for the trajectory branch, $\rm{{j}}$ is calculated based on the concatenation of measurement feature and the image feature from average pooling. Therefore, ${\rm{j_0^{ctl}}}$ and ${\rm{j_0^{traj}}}$ are different. For the trajectory branch, the initial hidden state is ${\rm{j_0^{traj}}}$, and the input of the GRU for it is [waypoint, target_point]. Sorry for the possible unclear details in the paper. We will revise it to make it more clear.

I think the waypoints shown in your video is not normal. Could you please try to remove the noise added to the expert during data collection and see whether it is still the case?

Kin-Zhang commented 1 year ago

Could you please try to remove the noise added to the expert during data collection

But since the online leaderboard, the GPS also have the noise, it should be added during collection to maintain the same setting? Did the TCP dataset remove all noise of sensor? like GPS noise, camera distortion etc?

penghao-wu commented 1 year ago

No, not the noise in sensors. I mean the ExpertNoiser used in Roach.

vaydingul commented 1 year ago

Hi,

In the meantime, I've solved the waypoint generation issue. Currently, I am able to fetch them from the intermediate route traces.

Is there any code example that you can refer to understand the attention-guidance mechanism? I think it is still not that clear.

What I understand is the following: Trajectory Branch

Control Branch

penghao-wu commented 1 year ago

Trajectory Branch We remove the last FC layer from the official resnet structure so that the averaged pooled and flattened feature has a dim 512 which is concatenated with the measurement vector. The concatenated vector then go through a 3-layer MLP to get $j_{\rm traj}^0$ as the initial hidden state of the GRU. Control Branch For the 2-D attention weight, we just reshape the 1-D output from the MLP to make it $1\times H \times W$. You can refer to the code below for the $i^{th}$ iteration for the control branch. The first hidden state is zeros(256).

h = GRU([control_action, j], h) 
wp_att = MLP([h, traj_hidden_state[:, i]]).view(-1, 1, H, W) 
new_img_feature_emb = torch.sum(cnn_feature*wp_att, dim=(2, 3)) 
merged_feature = MLP([h, new_img_feature_emb]) 
dj = MLP(merged_feature) 
j = dj + j # future feature
control_action = PolicyHead(j) # future action