OpenDriveLab / TCP

[NeurIPS 2022] Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline.
Apache License 2.0
310 stars 40 forks source link

some details in dataset and network #2

Closed Kin-Zhang closed 2 years ago

Kin-Zhang commented 2 years ago

routes for training

It said that it collected 8 towns' data, but in the CARLA leaderboard, it just provided the six towns' routes XML. Paper said, "it is generated randomly with length ranging from 50 meters to 300 meters."

Could you please give the route files that the data collected or the random scripts? I think the training data sometimes is more important to the model... just sometimes. I'm just curious about the exact detail of the data routes. Since there is no table about others' methods but trained on your dataset in the paper also.

network

  1. The detail about the measurement encoder didn't illustrate in the paper, is that just one MLP like concat[v, turn_left, (x,y)] -> to the desired size? maybe adding the network details about the output, and input size will help readers know better about the network detail. Or what's the exact output size on $\mathbf F, \bf j_m$ which formed the j^{traj}

  2. the input about the whole network is just one image or $K=4$ nums of images?

penghao-wu commented 2 years ago

For the training routes, we just use the list of waypoints generated by the function world.get_map().generate_waypoints(2) and randomly choose start point and end point from it and project them to road using world_map.get_waypoint(start_wp.transform.location,project_to_road=True, lane_type=carla.LaneType.Driving) . Then we interpolate the trajectory defined by the start point and end point to calculate the total distance of the trajectory defined by them, and only keep those whose length are between 50 to 300m.

Thank you for your advice, we will add network details in the further version. For the network, the measurement encoder is implemented using just one MLP with the concatenation of inputs. The measurement encoder encodes the input to a vector of dim 128. And all the $\rm{\bf{j}}$ vectors has dim 256. The input is only one image. The future control predictions are performed with the abstractions of the future world states estimated by the temporal module.

Kin-Zhang commented 2 years ago

Thanks for replying. Sorry to bother you again about the detail... maybe it will open source in the future but it makes people confused while reading the paper before they can read your codes.


Here are other questions about the temporal module, in figure 3 the input the $\bf h{t-1}^{traj}, \bf h{t-1}^{ctl}$

  1. but where do these two inputs come from and what's their shape? it said the hidden state if from two branches. Is that means from one GRU in the Waypoint branch and some Linear in MLP in the control branch? so how many Linear in MLP and which one is to extract this hidden state? and what is the exact size of hidden states ...

image

  1. what's the detail about the blue and orange block of these two networks, is that so an MLP? The detail about one $\bf wp, \bf a_0$ is {x,y} and {steer, control, throttle} respectively, is that right?

  2. According to the paper, is that mean $K$ is also the number of GRU modules in the Trajectory branch?

  3. About the feature loss on $\bf j$, in the [55] Roach, it has the gt BEV that's why the feature can be one loss, but in your paper, the TCP just receive the image, how the expert feature can have such comparison with traj as they all have image input?

penghao-wu commented 2 years ago
  1. Since the temporal module is also implemented with a GRU, both these two hidden states are just the hidden state in each GRU. Both of them are initialized with all zeros and their dims are both 256.
  2. The blue and orange blocks are both MLP. The $\textbf{\rm{wp}}$ is just the {x, y} similar to that in Transfuser. And the output from control branch is the parameters for the Beta distribution. We follow the Beta distribution used in Roach, you can just refer it for details.
  3. Yes, K is the number of iterations for the GRU modules in both branches.
  4. Note that the feature loss in Roach is between the feature from the expert (which takes BEV gt as input) and the feature from the student (which takes raw image as input) in a knowledge distillation approach.
Kin-Zhang commented 2 years ago
  1. for this question, is that mean that your paper also has the BEV gt as input for expert, but not the raw image for input in the expert?
penghao-wu commented 2 years ago
  1. for this question, is that mean that your paper also has the BEV gt as input for expert, but not the raw image for input in the expert?

Yes, we use the same expert in roach as our expert.

Kin-Zhang commented 2 years ago

Thanks for the explainations.