Open gaojunbin opened 2 years ago
Hi, Mr. Bai @XuyangBai
I have trained Transfusion-L on nuscenes with your transfusion_nusc_voxel_L.py config file.
I think it is the same config with that of in the result of Table 11 in your paper. Transfusion-L w/ VoxelNet: NDS 70.1 & mAP 65.1.
In my training, I got a result with: NDS 64.4 & mAP 55.7
More details can be seen in the training log. http://www.junbin.xyz/reference/20220701_211652.log
Can you help me? Thank you.
Sorry, I notice that fade strategy is important. But I confuse if train with db_sampler for 20 epochs, will the performance drop about 10 points lower on mAP(55.7 v.s. 65.1).
I also see the issue #24 related. It seems train the model with db sampler but get even worse performance compared to train without db sampler.
Is it normal? Can you give more information.
I will continue to fine tune the model w/o db sampler from 15 epochs to see the result. I will feedback the result here.
BTW, can you share the training log you said in issue #24. My email is junbingao@hust.edu.cn
Thanks a lot!
@gaojunbin hello,have you finished the traing schedule, I'll appreciate if you share the log with fade strategy.Thanks a lot!
@yangsijing1995 hello, I will feedback the results w/ fade strategy here later today.
@yangsijing1995 @XuyangBai
Hi, I train the Transfusion-L with fade strategy. (First train 15 epochs with db_sampler. Then fine tuning 5 epochs w/o db_sampler). The config is the same as yours. More details can be seen in the log: http://www.junbin.xyz/reference/20220706_144643.log
Compared to the results of the paper, there are still some gaps (NDS 70.1 v.s. 68.6 & mAP 65.1 v.s. 62.8). Can you help me?
Thanks a lot.
@gaojunbin I change samples_per_gpu/lr to 4/0.0002, then got better result(mAP 64.72 NDS 69.55).But I don't think it's the key factor of the performance.I'm sorry that I can't share the training log due to policy of my company
@yangsijing1995 Thanks for your information. BTW, Have you tested a batchsize of 8 with the official config file? And can you share the result of Transfusion-L? Thanks a lot.
@gaojunbin yeah,my configuration is 8gpu * 4sample_per_gpu.sorry for that I can't share result with one due to policy of my company.But I will paste part of my training log here. 2022-07-06 17:26:21,871 - mmdet - INFO - Epoch [16][50/4004] lr: 7.381e-04, eta: 13:00:32, time: 2.345, data_time: 0.092, memory: 10059, lossheatmap: 0.5564, layer-1_losscls: 0.1033, layer-1_loss_bbox: 0.5618, matched_ious: 0.5610, loss: 1.2215, grad_norm: 0.8174 2022-07-06 17:28:14,237 - mmdet - INFO - Epoch [16][100/4004] lr: 7.349e-04, eta: 12:42:20, time: 2.247, data_time: 0.011, memory: 10467, lossheatmap: 0.5376, layer-1_losscls: 0.0935, layer-1_loss_bbox: 0.5504, matched_ious: 0.5641, loss: 1.1815, grad_norm: 0.6828 2022-07-06 17:30:06,197 - mmdet - INFO - Epoch [16][150/4004] lr: 7.318e-04, eta: 12:34:08, time: 2.239, data_time: 0.011, memory: 10467, lossheatmap: 0.5263, layer-1_losscls: 0.0916, layer-1_loss_bbox: 0.5357, matched_ious: 0.5643, loss: 1.1536, grad_norm: 0.6715 2022-07-06 17:31:59,354 - mmdet - INFO - Epoch [16][200/4004] lr: 7.286e-04, eta: 12:31:03, time: 2.263, data_time: 0.012, memory: 10467, lossheatmap: 0.5362, layer-1_losscls: 0.0934, layer-1_loss_bbox: 0.5418, matched_ious: 0.5637, loss: 1.1714, grad_norm: 0.6823
I also get a worse performance (mAP: 63.06). Did you find the factors of the performance drop and reproduce the performance of the paper? Thanks.
@gaojunbin Hello, I have the same problem. Did you find the problems? Thanks!
@gaojunbin I also encountered a similar problem. Have you solved the problem? Thanks.
@fcjian @Leedonus @Tanzichang Sorry for the late reply. I haven't solved the problem. I repeatedly trained again and got the similar results. I also found that replace the transformer decoder head with centerpoint head can get the similar result (NDS 68.51 with fade strategy). I did not continue to explore where the gap came from, nor did I get the author's reply. If everyone solves this problem, welcome to discuss in this issue. Thanks!
Hi @gaojunbin sorry for the late reply, it seems I have missed this issue. I just checked your description, and your config file and training log both look good to me, but with a consistent higher loss compared with mine. Currently, I also have no idea about this performance gap but I am suspecting that the problem is coming from the gt database, you might check the gt database generating process. I will send the training log from other's reproduction to your email for your reference.
@XuyangBai Thanks for your reply. I will check it.
@XuyangBai Thanks for your reply. I will check it.
Hi, junbin, I meet same problem with you, and I also don't know why. For pillar backbone, if use gtaug to train for 20epoch, the performance is droped compared with centerpoint-pillar, and for voxel-backbone, the performance still drop compared with centerpoint, and my eval result of epoch_20.pth of voxel backbone is almost the same,map 0.5527, nds 0.6426. I just wonder did you find why。
Hi @gaojunbin sorry for the late reply, it seems I have missed this issue. I just checked your description, and your config file and training log both look good to me, but with a consistent higher loss compared with mine. Currently, I also have no idea about this performance gap but I am suspecting that the problem is coming from the gt database, you might check the gt database generating process. I will send the training log from other's reproduction to your email for your reference.
Hi, Mr. Bai @XuyangBai
I have trained Transfusion-L on nuscenes with your transfusion_nusc_voxel_L.py config file.
I think it is the same config with that of in the result of Table 11 in your paper. Transfusion-L w/ VoxelNet: NDS 70.1 & mAP 65.1.
In my training, I got a result with: NDS 64.4 & mAP 55.7
More details can be seen in the training log. http://www.junbin.xyz/reference/20220701_211652.log
Can you help me? Thank you.
Hi! I'm also trying to reproduce the result of Transfusion-L. I haven't finish training yet. But I find the preliminary results are kind of low in my training using samples_per_gpu=6. After epoch 1: loss/object/lossheatmap: 1.0337, loss/object/layer-1_losscls: 0.2027, loss/object/layer-1_loss_bbox: 1.3244, stats/object/matched_ious: 0.3770, loss: 2.5607 object/nds: 0.3508, object/map: 0.3221 After epoch 2: oss/object/lossheatmap: 0.8460, loss/object/layer-1_losscls: 0.1480, loss/object/layer-1_loss_bbox: 1.0419, stats/object/matched_ious: 0.4366, loss: 2.0360 object/nds: 0.5025, object/map: 0.4390
Compared to the this log (http://www.junbin.xyz/reference/20220701_211652.log), my losses are smaller and match_ious are bigger, but my mAP and NDS are inferior. Is this normal?
Also, could you please send me the log you mentioned above, that this the reproduction training log of transfusion-L of someone else. Thank you so much!! My email is: kiki_jiang@sjtu.edu.cn
Hi, Mr. Bai @XuyangBai
I have trained Transfusion-L on nuscenes with your transfusion_nusc_voxel_L.py config file.
I think it is the same config with that of in the result of Table 11 in your paper. Transfusion-L w/ VoxelNet: NDS 70.1 & mAP 65.1.
In my training, I got a result with: NDS 64.4 & mAP 55.7
More details can be seen in the training log. http://www.junbin.xyz/reference/20220701_211652.log
Can you help me? Thank you.