Tsinghua-MARS-Lab / StateTransformer

136 stars 7 forks source link

Cannot reproduce the evaluation numbers for motion planning #159

Open youngsjin92 opened 2 months ago

youngsjin92 commented 2 months ago

Comparing with the planning evaluation numbers of table 1 in your paper, trained mini or small models from my side give lower evaluation numbers as shown in the below table.

I use the data which you shared in the following link. I execute my training work through the first command example in To train and evaluate during training: and I checked that my training arguments are identical to your training arguments.

Are any additional steps required to re-produce your numbers? (Such as use pre-trained weights, data augmentation/processing or others.)

<html xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

  | 8sADE | 3sFDE | 5sFDE | 8sFDE | MR -- | -- | -- | -- | -- | -- Paper - Mini | 2.07 | 1.20 | 2.43 | 5.14 | 0.067 Re-prod - Mini | 3.12 | 1.93 | 3.86 | 6.99 | 0.162 Paper - Small | 1.91 | 1.05 | 2.22 | 4.83 | 0.049 Re-prod - Small | 2.08 | 1.19 | 2.47 | 5.05 | 0.070

Thanks.

larksq commented 2 months ago

Hi, glad to help reproduce the training results.

The most likely gap is the size of the training dataset. Our original training used a dataset with about 700M data samples. The dataset from the link has about 70M training samples with a 1/10 sampling frequency. The best way to scale up is to add this '--augment_index 5' to your training arguments. This arg triggers a random index offset from -5 to 5 making it the same as training with a 10 times more dense training sampling.

You can try this with the mini model for quick verification. Let me know if it does not work. We will also schedule a retrain with the latest version of our code on our server to double-check.

larksq commented 2 months ago

I retrained the model with the 'augment_index' argument and got a similar result as your previous experiment.

I think the remaining gap might be due to the diffusion decoder we used in the original paper. I will schedule a rerun without the diffusion decoder for easy reproduction on the NuPlan dataset. The plan is to update the results to the readme file before the August.

Feel free to ask if you need more eval results (like eval loss or 8sFHE perhaps?) for comparison.

youngsjin92 commented 2 months ago

Thanks for sharing your results. The number you shared in the previous command comes from the mini planning model with '--augment_index 5', right?

In addition, I have few questions.

  1. Why are the number of data samples reduced to 700M to 70M?
  2. In your paper, you mentioned that the number of data samples is 15M (15,523,077 samples), and 15M samples are used in my training process. Why are the number of samples different from what you mentioned (700/70M -> 15M)?
  3. Are the evaluation numbers in table 1 of your original paper from the models with the diffusion decoder or MLP decoder?
larksq commented 2 months ago

Sorry for the confusion. To your previous questions:

  1. The number of the training set can vary due to different filtering. Let me clarify a few numbers:
    • The number of the uploaded dataset should be 7,191,710 (7M, not 70M, sorry for the wrong number from the previous reply)
    • By using augment_index, the result should be the same as an x10 sampling frequency, meaning 70M, due to a random frame sample offset.
    • For the paper, we used an older version of the training dataset of 15M.
  2. The 15M -> 7M difference might be due to a still-filtering we use during dataset generation. This means about half of the samples are completely still for the whole scenario. Check the flag 'filter_still' in generation.py for more.
  3. The evaluation numbers in Table 1 are from the diffusion decoder.
youngsjin92 commented 2 months ago

Thanks for your reply.

What I am confused is that the model using a checkpoint you shared in this link is identical to your number STR(CKS)-16m in Table 1, and this model is with a MLP decoder.

As you said that the models in Table 1 use a diffusion decoder, are the results in Table 1 from smaller training samples? (I am curious why the model with a MLP decoder using your checkpoint is as good as the model with a diffusion decoder)