Why two step training - Githubissues

JunqiaoLi commented 1 year ago

Hello, thanks for your excellent work! I am a new ont to end-to-end tracking, so I wonder why would you separate the training into two steps (i.e. f1_q500_1600x640.py then f3_q500_1600x640.py) ? What if I use the second training step directly without the first step?

Looking forward to your reply.

ziqipang commented 1 year ago

@JunqiaoLi When I train for the resolution of 1600x640 for f3, the CUDA memory of my GPUs can only support freezing the backbone. Therefore, to train the backbone to extract features suiting NuScenes, I adopt the two-stage training for 1600x640: (1) train everything under f1 setting; (2) freeze the backbone, and only train the tracking part, under f3 setting.

I adopt the same idea in the smaller resolution setting because decoupling the training into two steps also helped me in iterate faster.

ziqipang commented 1 year ago

@JunqiaoLi Any follow-ups?

JunqiaoLi commented 1 year ago

@ziqipang Sorry to bother you again, I still have one more question. I have trained the f1-petr step, whose AP is 0.2886 and NDS is 0.3231. Then I train the f3-petr step using 'load_from=${f1-petr_model}', but at the begining, the loss is extremely huge(>1000), mainly from loss_mem_cls(200~300), like the pic. Have you ever met this?

ziqipang commented 1 year ago

@JunqiaoLi I cannot remember the details. Could you please check and compare with the training logs provided by us, as described in (https://github.com/TRI-ML/PF-Track/blob/main/documents/pretrained.md#2-pretrained-pf-track-models-download)?

JunqiaoLi commented 1 year ago

@JunqiaoLi I cannot remember the details. Could you please check and compare with the training logs provided by us, as described in (https://github.com/TRI-ML/PF-Track/blob/main/documents/pretrained.md#2-pretrained-pf-track-models-download)?

Yes I have checked this, the loss of your f3_all model at the beginning is about 253 then 70 (like the pic), which is different from mine.

So you have never met the loss is around 1000 like mine?

ziqipang commented 1 year ago

@JunqiaoLi I see. I am not sure about the issue here. How about you try loading from the f1 checkpoint provided by us?

JunqiaoLi commented 1 year ago

@JunqiaoLi I see. I am not sure about the issue here. How about you try loading from the f1 checkpoint provided by us?

@ziqipang Good, I'm trying on this. Another question, have you ever used r50 as backbone, since the loss 1000 is from r50.

ziqipang commented 1 year ago

@JunqiaoLi I haven't tried this. After knowing this, I also suggest checking if you have cast the features from r50 into fp16, which could result in a larger loss compared to fp32.

JunqiaoLi commented 1 year ago

@ziqipang Hi, I think I have found the reason through many experiments. Sine I have enough GPU memory, I still open "train_backbone=True" for the f3 training part, which leads to non-convergence. When I set train_backbone=False, it is much better. I do think this is a little bit strange, have you ever met this before?

ziqipang commented 1 year ago

@JunqiaoLi In my own experiments (f3, 800x640, PETR, VoVNet backbone), I haven't met this before. So you initialize the whole network from f1 models, right?

JunqiaoLi commented 1 year ago

Hi ziqipang, I have migrated pf-track to another dataset. With some experiments, I found that if the backbone is big, like vovnet, there will not be the "huge loss" problem. However, when the backbone is too small, there might be this problem. But for nuscenes, it's good enough

ziqipang commented 1 year ago

@JunqiaoLi Thanks for the info!

TRI-ML / PF-Track

Why two step training #14