Low performance while reproducing Transfusion-L(NDS 70.1 v.s. 68.6)

gaojunbin commented 2 years ago

Hi, Mr. Bai @XuyangBai

I have trained Transfusion-L on nuscenes with your transfusion_nusc_voxel_L.py config file.

I think it is the same config with that of in the result of Table 11 in your paper. Transfusion-L w/ VoxelNet: NDS 70.1 & mAP 65.1.

In my training, I got a result with: NDS 64.4 & mAP 55.7

More details can be seen in the training log. http://www.junbin.xyz/reference/20220701_211652.log

Can you help me? Thank you.

gaojunbin commented 2 years ago

Hi, Mr. Bai @XuyangBai

I have trained Transfusion-L on nuscenes with your transfusion_nusc_voxel_L.py config file.

I think it is the same config with that of in the result of Table 11 in your paper. Transfusion-L w/ VoxelNet: NDS 70.1 & mAP 65.1.

In my training, I got a result with: NDS 64.4 & mAP 55.7

More details can be seen in the training log. http://www.junbin.xyz/reference/20220701_211652.log

Can you help me? Thank you.

Sorry, I notice that fade strategy is important. But I confuse if train with db_sampler for 20 epochs, will the performance drop about 10 points lower on mAP(55.7 v.s. 65.1).

I also see the issue #24 related. It seems train the model with db sampler but get even worse performance compared to train without db sampler.

Is it normal? Can you give more information.

I will continue to fine tune the model w/o db sampler from 15 epochs to see the result. I will feedback the result here.

BTW, can you share the training log you said in issue #24. My email is junbingao@hust.edu.cn

Thanks a lot!

yangsijing1995 commented 2 years ago

@gaojunbin hello,have you finished the traing schedule, I'll appreciate if you share the log with fade strategy.Thanks a lot!

gaojunbin commented 2 years ago

@yangsijing1995 hello, I will feedback the results w/ fade strategy here later today.

gaojunbin commented 2 years ago

@yangsijing1995 @XuyangBai

Hi, I train the Transfusion-L with fade strategy. (First train 15 epochs with db_sampler. Then fine tuning 5 epochs w/o db_sampler). The config is the same as yours. More details can be seen in the log: http://www.junbin.xyz/reference/20220706_144643.log

Compared to the results of the paper, there are still some gaps (NDS 70.1 v.s. 68.6 & mAP 65.1 v.s. 62.8). Can you help me?

Thanks a lot.

yangsijing1995 commented 2 years ago

@gaojunbin I change samples_per_gpu/lr to 4/0.0002, then got better result(mAP 64.72 NDS 69.55).But I don't think it's the key factor of the performance.I'm sorry that I can't share the training log due to policy of my company

gaojunbin commented 2 years ago

@yangsijing1995 Thanks for your information. BTW, Have you tested a batchsize of 8 with the official config file? And can you share the result of Transfusion-L? Thanks a lot.

yangsijing1995 commented 2 years ago

@gaojunbin yeah,my configuration is 8gpu * 4sample_per_gpu.sorry for that I can't share result with one due to policy of my company.But I will paste part of my training log here. 2022-07-06 17:26:21,871 - mmdet - INFO - Epoch [16][50/4004] lr: 7.381e-04, eta: 13:00:32, time: 2.345, data_time: 0.092, memory: 10059, lossheatmap: 0.5564, layer-1_losscls: 0.1033, layer-1_loss_bbox: 0.5618, matched_ious: 0.5610, loss: 1.2215, grad_norm: 0.8174 2022-07-06 17:28:14,237 - mmdet - INFO - Epoch [16][100/4004] lr: 7.349e-04, eta: 12:42:20, time: 2.247, data_time: 0.011, memory: 10467, lossheatmap: 0.5376, layer-1_losscls: 0.0935, layer-1_loss_bbox: 0.5504, matched_ious: 0.5641, loss: 1.1815, grad_norm: 0.6828 2022-07-06 17:30:06,197 - mmdet - INFO - Epoch [16][150/4004] lr: 7.318e-04, eta: 12:34:08, time: 2.239, data_time: 0.011, memory: 10467, lossheatmap: 0.5263, layer-1_losscls: 0.0916, layer-1_loss_bbox: 0.5357, matched_ious: 0.5643, loss: 1.1536, grad_norm: 0.6715 2022-07-06 17:31:59,354 - mmdet - INFO - Epoch [16][200/4004] lr: 7.286e-04, eta: 12:31:03, time: 2.263, data_time: 0.012, memory: 10467, lossheatmap: 0.5362, layer-1_losscls: 0.0934, layer-1_loss_bbox: 0.5418, matched_ious: 0.5637, loss: 1.1714, grad_norm: 0.6823

fcjian commented 2 years ago

I also get a worse performance (mAP: 63.06). Did you find the factors of the performance drop and reproduce the performance of the paper? Thanks.

Leedonus commented 2 years ago

@gaojunbin Hello, I have the same problem. Did you find the problems? Thanks!

Tanzichang commented 2 years ago

@gaojunbin I also encountered a similar problem. Have you solved the problem? Thanks.

gaojunbin commented 2 years ago

@fcjian @Leedonus @Tanzichang Sorry for the late reply. I haven't solved the problem. I repeatedly trained again and got the similar results. I also found that replace the transformer decoder head with centerpoint head can get the similar result (NDS 68.51 with fade strategy). I did not continue to explore where the gap came from, nor did I get the author's reply. If everyone solves this problem, welcome to discuss in this issue. Thanks!

XuyangBai commented 2 years ago

Hi @gaojunbin sorry for the late reply, it seems I have missed this issue. I just checked your description, and your config file and training log both look good to me, but with a consistent higher loss compared with mine. Currently, I also have no idea about this performance gap but I am suspecting that the problem is coming from the gt database, you might check the gt database generating process. I will send the training log from other's reproduction to your email for your reference.

gaojunbin commented 2 years ago

@XuyangBai Thanks for your reply. I will check it.

AndyYuan96 commented 2 years ago

@XuyangBai Thanks for your reply. I will check it.

Hi, junbin, I meet same problem with you, and I also don't know why. For pillar backbone, if use gtaug to train for 20epoch, the performance is droped compared with centerpoint-pillar, and for voxel-backbone, the performance still drop compared with centerpoint, and my eval result of epoch_20.pth of voxel backbone is almost the same，map 0.5527， nds 0.6426. I just wonder did you find why。

HatakeKiki commented 2 years ago

Hi @gaojunbin sorry for the late reply, it seems I have missed this issue. I just checked your description, and your config file and training log both look good to me, but with a consistent higher loss compared with mine. Currently, I also have no idea about this performance gap but I am suspecting that the problem is coming from the gt database, you might check the gt database generating process. I will send the training log from other's reproduction to your email for your reference.

Hi, Mr. Bai @XuyangBai

I have trained Transfusion-L on nuscenes with your transfusion_nusc_voxel_L.py config file.

I think it is the same config with that of in the result of Table 11 in your paper. Transfusion-L w/ VoxelNet: NDS 70.1 & mAP 65.1.

In my training, I got a result with: NDS 64.4 & mAP 55.7

More details can be seen in the training log. http://www.junbin.xyz/reference/20220701_211652.log

Can you help me? Thank you.

Hi! I'm also trying to reproduce the result of Transfusion-L. I haven't finish training yet. But I find the preliminary results are kind of low in my training using samples_per_gpu=6. After epoch 1: loss/object/lossheatmap: 1.0337, loss/object/layer-1_losscls: 0.2027, loss/object/layer-1_loss_bbox: 1.3244, stats/object/matched_ious: 0.3770, loss: 2.5607 object/nds: 0.3508, object/map: 0.3221 After epoch 2: oss/object/lossheatmap: 0.8460, loss/object/layer-1_losscls: 0.1480, loss/object/layer-1_loss_bbox: 1.0419, stats/object/matched_ious: 0.4366, loss: 2.0360 object/nds: 0.5025, object/map: 0.4390

Compared to the this log (http://www.junbin.xyz/reference/20220701_211652.log), my losses are smaller and match_ious are bigger, but my mAP and NDS are inferior. Is this normal?

Also, could you please send me the log you mentioned above, that this the reproduction training log of transfusion-L of someone else. Thank you so much!! My email is: kiki_jiang@sjtu.edu.cn

XuyangBai / TransFusion

Low performance while reproducing Transfusion-L(NDS 70.1 v.s. 68.6) #37