The training effect is very poor！

autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

MIT License

1.12k stars 186 forks source link

The training effect is very poor！ #224

Closed zygalaxy closed 3 months ago

zygalaxy commented 3 months ago

I used four 3080ti, the CPU was 40 cores, and I trained for 41 times, the training duration was 37 hours, and the training code was CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=40 OPENBLAS_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node=4 --max_restarts=0 --rdzv_id=1234576890 --rdzv_backend=c10d train.py --root_dir /home/ubuntu/ZY/ZY_2T/PROJECT_AUTO/transfuser/dataset --parallel_training 1 --batch_size 12, not using all 210G data sets, but only using about 160G data sets, as shown in the following figure. However, the test result of the trained model is very poor. The test code is, and the visual results are as follows bash leaderboard/scripts/local_evaluation.sh. What is the problem? I hope to get a solution, and I will be very grateful.

Kait0 commented 3 months ago

This looks like a bug or something is wrong with the setup. Could you set loading to strict to see whether the model was loaded correctly. Did the training losses go down properly?

zygalaxy commented 3 months ago

I followed your instructions and encountered two issues: 1)My total training loss is shown in the figure below. f9feee74ec681f249af53e8488df9f99 2)The error message when strict is set to True is different for my model and the Transfuser model (both fail to execute with strict set to True). Figure 1 shows the error message for my_train_model, and Figure 2 shows the error message for the Transfuser model (since the full message cannot be displayed, I have also submitted error.txt for my_model). error.txt

The loss seems normal, so as you mentioned, the issue is likely related to model loading. However, based on the error messages, I can't determine the exact problem. I hope to get your advice, thank you!

Kait0 commented 3 months ago

The error for the Transfuser model is normal, it was trained with an older codebase.

It probably has something to do with the name. Did you change how the model was saved here?

When trained with distributed data parallel we remove part of the name that comes from ddp here.

Are there any optimizer files in your model folder (that could get loaded accidentally)?

zygalaxy commented 3 months ago

You are right. I found that my training results seem to contain optimization files (as shown in Figure 1 below). My submission_agent.py is shown in Figure 2 below, and the training code is CUDA_VISIBLE_DEVICES=0,1,2. 3 OMP_NUM_THREADS=40 OPENBLAS_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node=4 --max_restarts=0 --rdzv_id=1234576890 --rdzvbackend=c10d train.py - -root dir/home/Ubuntu/zy/zy 2t/project auto/transfuser/dataset-parallel training1-batch size12, what's wrong with this training result? When I use model_41.pth+optimizer_41.pth, the problem is that the vehicle directly turns right and deviates from the route. When I only use the model_41.pth file, the problem is that the vehicle does not move at all (if the transfuser model is used, the vehicle runs well). 113b9f5be50f98be2791a5411749a69d

Kait0 commented 3 months ago

Your training result seems correct. Inference needs to be different. The correct way to use the code is to copy the args.txt + model_31.pth file into a new folder, my_model in your case, (no optimizer files) and use this folder for inference. Then you should not get any error when you load the model with strict. Can you send the visualization file again when you do this.

The inference code loads every .pth file in the given folder and ensembles the models.

zygalaxy commented 3 months ago

Thank you very much for your help. It seems that my model has always been good, but it can't run normally because of the existence of the optimization file. In fact, I found that both model_31.pth and model_41.pth can be executed normally, but it takes a long time for the vehicle to start, which seems to be too cautious? Now I have multiple models, such as 31, 35, 41, etc. How should I combine them to get the best results?

Kait0 commented 3 months ago

Typically epoch 31 yields the best or close to the best results. Sometimes TransFuser gets stuck due to causal confusion. If it does not happen very often this is normal. After a while a crepping mechanism will unblock it if this happens. This is discussed in Table 10 of the paper.

zygalaxy commented 3 months ago

Then it means that I should directly change the default epoch=41 to 31 and use 31 as the final model, which can also reduce the training time.

Kait0 commented 3 months ago

you can do that. We have done that as well in our follow up work.

zygalaxy commented 3 months ago

Ok, thank you for your patience. I appreciate it very much. I will close this question.