Closed zhouyunsong closed 2 months ago
Hi @zhouyunsong, the training process you described seems abnormal. The loss should typically drop below 1 fairly quickly. The training log is currently being reproduced; please wait for a moment.
Thanks for your replay! For ease of reproduction, I run the train_nuplan_model_v1_1.sh
on 8 A100 GPUs, the model
is set to gump_nuplan_llama_sm_v1_1
. I would also like to ask if the model needs to be trained in multiple stages (_e.g. image encoder, embedder, transition_model, tokendecoder) or is a one-time TRAIN FROM SCRATCH just fine?
Train from scratch is fine, the training process seems ok on my end. Did you check tensorboard visualization at each epoch end? Is the training target normal?
The training target seems okay: But the visualization after 15 epochs is not good (the loss is around 3.2 at this point): Here is a reference config file in exp log, please correct me If there's anything wrong. config.txt
Could you try to set the shuffle=true right here?(https://github.com/HorizonRobotics/GUMP/blob/main/nuplan_extent/planning/script/config/common/model/gump_nuplan_lamma_sm_v1_1.yaml#L28) May I know how many samples in your training set?
here is the training loss
Sincerely thank you for your help! I'll try to train again with the modified config and let you know the final results.
Hi @zhouyunsong ,I am facing a similar problem—the loss remains high and doesn't drop significantly. Were you able to resolve this issue? If so, could you share any insights or changes you made to get it working? Thank you!
Hello @Yihanhu , thank you for your impressive work. I have attempted to reproduce the training results on nuPlan using the provided configs, but the training loss (around 3.8) consistently differs significantly from the checkpoint (around 1.0) you provided. Any comments on the training process? Or do you have any plans to release the logs or provide more details (such as the number of GPUs, training settings, pre-trained models, etc.)? Thank you very much.