HorizonRobotics / GUMP

Generative model for Unified Motion Planning tasks
Apache License 2.0
119 stars 8 forks source link

Plans for releasing training logs #5

Closed zhouyunsong closed 2 months ago

zhouyunsong commented 3 months ago

Hello @Yihanhu , thank you for your impressive work. I have attempted to reproduce the training results on nuPlan using the provided configs, but the training loss (around 3.8) consistently differs significantly from the checkpoint (around 1.0) you provided. Any comments on the training process? Or do you have any plans to release the logs or provide more details (such as the number of GPUs, training settings, pre-trained models, etc.)? Thank you very much.

Yihanhu commented 3 months ago

Hi @zhouyunsong, the training process you described seems abnormal. The loss should typically drop below 1 fairly quickly. The training log is currently being reproduced; please wait for a moment.

zhouyunsong commented 3 months ago

Thanks for your replay! For ease of reproduction, I run the train_nuplan_model_v1_1.sh on 8 A100 GPUs, the model is set to gump_nuplan_llama_sm_v1_1. I would also like to ask if the model needs to be trained in multiple stages (_e.g. image encoder, embedder, transition_model, tokendecoder) or is a one-time TRAIN FROM SCRATCH just fine?

Yihanhu commented 3 months ago

Train from scratch is fine, the training process seems ok on my end. Did you check tensorboard visualization at each epoch end? Is the training target normal?

zhouyunsong commented 3 months ago

The training target seems okay: image But the visualization after 15 epochs is not good (the loss is around 3.2 at this point): image Here is a reference config file in exp log, please correct me If there's anything wrong. config.txt

Yihanhu commented 3 months ago

Could you try to set the shuffle=true right here?(https://github.com/HorizonRobotics/GUMP/blob/main/nuplan_extent/planning/script/config/common/model/gump_nuplan_lamma_sm_v1_1.yaml#L28) May I know how many samples in your training set?

Yihanhu commented 3 months ago

img_v3_02dv_f6761ab9-d0cc-4c90-a571-a94ac7b1744g here is the training loss

zhouyunsong commented 3 months ago

Sincerely thank you for your help! I'll try to train again with the modified config and let you know the final results.

Keirra0116 commented 1 week ago

Hi @zhouyunsong ,I am facing a similar problem—the loss remains high and doesn't drop significantly. Were you able to resolve this issue? If so, could you share any insights or changes you made to get it working? Thank you!