Questions regarding the fine-tuning of the pre-trained model

Your paper, GitHub, and issues have been very helpful to me. Thank you. I would like to perform transfer learning using the pre-trained model you provided.

Objective: I want to use the capabilities of the Transfuser model while making some modifications for waypoint prediction. Ultimately, I plan to use my own dataset.

Currently, I am facing difficulties with model training due to using a single GPU, which takes a lot of time. So, I use the pre-trained model to achieve my goal. I am encountering issues in the process. Therefore, I would like to ask you a few questions and seek your valuable advice.

Before using my dataset, I performed transfer learning using the pre-trained model and only a subset of the dataset among the entire dataset you provide.

right_dataset_23_11

1-1. I activated only the join, decoder, and output parts of the model, while freezing the other parameters. Additionally, I selected three components that I believe influence waypoint prediction and proceeded with the following combinations.
-> ① [join, decoder, output] ② [decoder, output] ③ [output]

for name, param in model.named_parameters():
    if any(name.startswith(prefix) for prefix in ['join', 'decoder', 'output']):
        param.requires_grad = True   # Unfreeze: join, decoder, output
    else:
        param.requires_grad = False  # Freeze: All other parameters

fine_tune <Figure 1>

The loss is as follows. (It is higher than the loss for 1-2.) finetune_fru_0829_loss <Figure 2>

The actual driving results in are not satisfactory. The vehicle either does not go where expected or does not perform correctly.

1-2. I re-trained the pre-trained model without modifying the parameters (all parameters were unfrozen). retrain <Figure 3>

Continued training using a checkpoint. Although the results for bev, semantic, and depth were better than in 1-1, they were still not as good as simply evaluating the pre-trained model. (The vehicle either stops while moving or has a large turning radius, leading to collisions.)

And the loss is as follows. retrain_0830_loss <Figure 4>

The images checked in the visualization folder for both 1-1 and 1-2 did not show good results, especially for BEV.

My setup is as follows:

Single GPU; I didn't modify hyper-parameters such as learning rate or optimizer (kept the code as is).
During training, I executed the code below, and commented it out during the evaluation process (due to using a single GPU):
```
state_dict = {k[7:]: v for k, v in checkpoint.items()}
model.load_state_dict(state_dict, strict=False)
```
========================================================================= Thank you very much for reading through this lengthy message. I have some questions regarding the above content:

I used transfer learning to maintain the visual and spatial capabilities of the Transfuser model while updating the parameters involved in waypoint prediction to fit the new dataset. Is the research design of "only unfreezing some parameters of the Transfuser pre-trained model for training" fundamentally flawed?
If that is not the case, then following point 1, I set wp_only = 1 to consider only the waypoint loss and also froze the parameters. However, BEV is still outputting as shown in <Figure 1>. I am curious why the prediction results are not good, even though the BEV loss does not contribute to the backward propagation. The pre-trained model does not have issues, and the re-training results were relatively good. Additionally, could the poor results of BEV affect waypoint prediction? -> I only changed parameters to True/False. The data and learning environment were identical. (Case. 1-1, 1-2, 2)
When evaluating with the pre-trained model, the results are good, but when re-training and performing additional training, the results deteriorate. Could this be due to using fewer data samples or different training environments compared to you? -> Optimizer, Parallel training, etc. -> Dataset : right_dataset_23_11(The number of screens: 16,470)

Thank you once again for taking the time to read my lengthy message.

The questions I have are related to my personal research area rather than issues arising from the process of training or evaluating the Transfuser model. I have reviewed the code and considered various aspects on my own, but I found it difficult to identify the cause. I would greatly appreciate your insights on this matter.

1-1. You forgot to freeze the batch norm weights in the encoder. Batch norm weights work differently than the other since they are not updated with gradients.

It is a bit tricky to freeze batch norms. Methods like detr use custom implementations of batch norm to freeze the weights https://github.com/facebookresearch/detr/blob/main/models/backbone.py Other people recommend calling module.eval() on the batch norm modules. https://discuss.pytorch.org/t/how-to-freeze-bn-layers-while-training-the-rest-of-network-mean-and-var-wont-freeze/89736/10

It is important for these pre-trained weights because their training code had some bugs and the batch norms were not trained properly (so now if you do train them you change the network quite a bit).

You could use these weights instead https://drive.google.com/file/d/1CeWcADEOf4DoPywxaWmKIGSpvUDTuLRK/view?usp=sharing were the batch norms were trained (the models have similar performance).

right_dataset_23_11

Seems like a bad idea to me to finetune the model only on right turns. It might forget how to turn left for example. If you want to fine-tune with less data overall you could use the whole dataset with less epochs.

Your figures are broken I can't see them.

Single GPU; I didn't modify hyper-parameters such as learning rate or optimizer (kept the code as is).

You did implicitly change the batch size. In pytorch batch size is set per GPU, so if you reduce the number of GPUs from 8->1 the batch size will be 8x smaller. We trained with 2080ti gpus which have ~11GB of memory. If your GPU has more you could simply increase batch size. If not a common trick is to reduce learning rate proportionally (8x in this case), as you will do more gradient steps overall with a smaller batch size and the smaller learning rate might counter the noisier gradient.

I used transfer learning to maintain the visual and spatial capabilities of the Transfuser model while updating the parameters involved in waypoint prediction to fit the new dataset. Is the research design of "only unfreezing some parameters of the Transfuser pre-trained model for training" fundamentally flawed?

No I think freezing just didn't work as intended. Finetuning on top of frozen layers should work (if the trained layers are at the end of the network).

If that is not the case, then following point 1, I set wp_only = 1 to consider only the waypoint ...

Same problem as 1.

When evaluating with the pre-trained model, the results are good, but when re-training and performing additional training, the results deteriorate. Could this be due to using fewer data samples or different training environments compared to you?

See above, you shouldn't just train on right turns.

autonomousvision / transfuser

Questions regarding the fine-tuning of the pre-trained model #241