Waypoint Generation and Loss Architecture

vaydingul commented 1 year ago

Hi again 😄

I have a few questions regarding the waypoint generation and loss calculation.

First of all, there are a lot of feature vector representations in the paper (i.e., $\mathbf{j_m}$, $\mathbf{j^{traj}}$, $\mathbf{j^{ctl}}$). However, I think there are some ambiguities related to them. What I assumed is the following:

Let's say the output of the ResNet is $\mathbf{j_{image}}$, which is the average pooled and flattened version of $\mathbf{F}$; and, the output of the measurement encoder, which accepts (high-level navigational command, speed, target velocity) triplet as input, is $\mathbf{j_m}$.
Concatenation of these individual feature vectors gives the $\mathbf{j} = Concat(\mathbf{j_{image}}, \mathbf{j_m})$, which then will be fed to GRUs.
In #2, you've mentioned that the initial hidden input of GRUs is just zeros with the same dimension of the feature vector. In this case, how are we propagating the feature information to the GRUs? Multi-Step Control Prediction Part:
- For the multi-step control prediction part, based on the Figure 3 in the paper, my pseudocode is as follows: hidden = GRU([control_action, j], hidden) # first hidden is zeros(256) j = FeatureEncoder(hidden) control_action = PolicyHead(j)
- Then, in the loss calculation part, $\mathbf{j_0^{ctl}}$ is just $\mathbf{j}$, and, $\mathbf{j_n^{ctl}}$ ( $1 \leq n \leq K$ ) is the intermediate feature vectors as an output of Feature Encoder .

Waypoint Prediction Part:

For the trajectory module, in Figure 3, it does not seem like the feature vector is again fed to the input of the GRUs. However, in this case, it appears that the whole feature information is lost in the trajectory branch. Nonetheless, again, my preassumed pseudocode is as follows: hidden = GRU([waypoint, j], hidden) # first hidden is zeros(256), first waypoint is [0, 0] j = FeatureEncoder(hidden) delta_waypoint = WaypointHead(j) waypoint = waypoint + delta_waypoint
In the trajectory loss calculation, there is another feature prediction mentioned in the paper, which is denoted as $\mathbf{j_0^{traj}}$. Isn't it just the $\mathbf{j}$ again? Aren't we feeding the $\mathbf{j}$ to both branches?

I'd be very happy if you could give me an idea about the degree of correctness of my assumptions above regarding your model.

Waypoint Prediction: The question is about the generation of ground-truth waypoints. One natural solution might be to use the GNSS data in future frames. However, when the data is collected with Roach, due to the fluctuations in control actions (typical RL issue), waypoints become so noisy. I am attaching a video here. Note that in this video, the waypoints are strided (every fourth future GNSS data). When I collect the subsequent future datapoints directly, it nearly becomes a cluster of points rather than a spline-like curve that shows the future trajectory. Example of the not-strided case is here.

Finally, I am assuming that the waypoints are predicted with respect to the car's reference frame, not World's. Then, the average of amplitudes and rotations of vectors composed by the subsequent waypoints are fed to the longitudinal and lateral PID controller.

I know I am bugging you a lot 🥲 , but I think everything will be more clear for all readers through the answers to these questions.

penghao-wu commented 1 year ago

For the control branch, the init ${\rm{j_0^{ctl}}}$ is calculated based on the concatenation of measurement feature and the image feature from aggregating ${\rm{F}}$ using an init attention map, while for the trajectory branch, $\rm{{j}}$ is calculated based on the concatenation of measurement feature and the image feature from average pooling. Therefore, ${\rm{j_0^{ctl}}}$ and ${\rm{j_0^{traj}}}$ are different. For the trajectory branch, the initial hidden state is ${\rm{j_0^{traj}}}$, and the input of the GRU for it is [waypoint, target_point]. Sorry for the possible unclear details in the paper. We will revise it to make it more clear.

I think the waypoints shown in your video is not normal. Could you please try to remove the noise added to the expert during data collection and see whether it is still the case？

Kin-Zhang commented 1 year ago

Could you please try to remove the noise added to the expert during data collection

But since the online leaderboard, the GPS also have the noise, it should be added during collection to maintain the same setting? Did the TCP dataset remove all noise of sensor? like GPS noise, camera distortion etc?

penghao-wu commented 1 year ago

No, not the noise in sensors. I mean the ExpertNoiser used in Roach.

vaydingul commented 1 year ago

Hi,

In the meantime, I've solved the waypoint generation issue. Currently, I am able to fetch them from the intermediate route traces.

Is there any code example that you can refer to understand the attention-guidance mechanism? I think it is still not that clear.

What I understand is the following: Trajectory Branch

The output of the measurement encoder (dim:128) and the image encoder (dim:1000 according to the official ResNet34 implementation) are concatenated.
The concatenated output (according to #2, it should be a vector of zeros) is fed to the GRU as the first hidden state. Moreover, in each time step, the input to the GRU cell is the concatenation of the waypoint prediction at that timestep ([0, 0] for the first time step) and the target point coordinates in the vehicle reference frame.
(Note: In paper and here, it is usually mentioned that the output of the measurement vector and the average pooled feature map F are concatenated. To which ResNet layer output does F correspond to? Do you mean average pooled - flattened - processed by linear layer (dim:1000) by saying "F is average pooled" or just average pooled - flattened (dim: God knows)?)

Control Branch

This part became a complete ambiguity for me.
According to the paper and your answer above, first, I need to calculate the attention map to calculate the $\mathbf{j^{ctl}}$.
For the initial attention map, it is mentioned that only the output of the measurement encoder is required. Therefore, in this case, I assume that the formula for initial attention is the following $$w_0=MLP(\mathbf{j_m})$$
However, it has been also stated that the size of the attention map is $(1 \times H \times W)$. So, how does this conversion happens from $(n)$ to $(1 \times H \times W)$? $n$ is the output dimension of the $MLP$ above and $(H, W)$ is the dimension of the feature map.
Let's say everything is OK until that part. Then, how will we calculate the attention map for feature timesteps? It seems like the dimension of the $\mathbf{j_m}$ and the $Concat([\mathbf{h^{traj}}, \mathbf{h^{ctl}}])$ will not be the same.
Let's again say everything is solved until that part. Then, is the iteration scheme that I explained above correct for the control branch? hidden = GRU([control_action, j], hidden) _# is the first hidden $\mathbf{j0^{ctl}}$ or zeros(256)? j = FeatureEncoder(hidden) control_action = PolicyHead(j)

penghao-wu commented 1 year ago

Trajectory Branch We remove the last FC layer from the official resnet structure so that the averaged pooled and flattened feature has a dim 512 which is concatenated with the measurement vector. The concatenated vector then go through a 3-layer MLP to get $j_{\rm traj}^0$ as the initial hidden state of the GRU. Control Branch For the 2-D attention weight, we just reshape the 1-D output from the MLP to make it $1\times H \times W$. You can refer to the code below for the $i^{th}$ iteration for the control branch. The first hidden state is zeros(256).

h = GRU([control_action, j], h) 
wp_att = MLP([h, traj_hidden_state[:, i]]).view(-1, 1, H, W) 
new_img_feature_emb = torch.sum(cnn_feature*wp_att, dim=(2, 3)) 
merged_feature = MLP([h, new_img_feature_emb]) 
dj = MLP(merged_feature) 
j = dj + j # future feature
control_action = PolicyHead(j) # future action

OpenDriveLab / TCP

Waypoint Generation and Loss Architecture #5