Open d1024choi opened 1 year ago
Hello, bro! Are you eager to share your reproduced project? Wait the official released codes for a long time.
Hello, bro! Are you eager to share your reproduced project? Wait the official released codes for a long time.
I have been struggling to reproduce the paper but, unfortunately, I find it hard to get one close to the published results. I am not sure I could make it but I would love to share.
Your assistance has been truly appreciated! I was wondering if there is a link where I can access the codes you've reproduced, or if it's possible for you to send the codes to my email【wangzhechao21@mails.ucas.ac.cn】?
I think your work is based on BEVFormer (ECCV 22) and you significantly improved the segmentation performance by introducing two main ideas:
- make BEV queries directly refer to surrounding images from previous time steps during deformable attention
- spatio-temporal hierarchical transformer decoder.
While I was reading your paper, I got some questions in my mind
The perception experiments in Table 2, BEVFormer static slightly underperforms TBP-Former static (BEVFormer static 44.4 vs TBPFormer static 44.8). In my reproducing experiments, when BEVFormer static is fed with 224 x 280 images (instead of 900 x 1600), the performance drops to 39.0, which is quite smaller than TBPFormer static (nearly 6 points). Because your TBPFormer static does not use your first idea, I can speculate that the performance gain nearly comes from your second idea. That means the performance difference between TBP-former static and TBP-former must be around 6 points. However, Table 2 informs me that the difference is 1.4 point (44.8 v.s. 46.2).
Could you give me some hints about what I am missing in the discussion above? Actually, I have been struggling implement your work and producing the results reported in your paper and I really want to know what am I missing in my implementation.
Your feedback is greatly appreciated.
Hi, I'm conducting some future instance segmentation task based on BEVFormer, I guess the bevfomer result reported in this paper is based on original resolution? (1600*900)
I think your work is based on BEVFormer (ECCV 22) and you significantly improved the segmentation performance by introducing two main ideas: 1) make BEV queries directly refer to surrounding images from previous time steps during deformable attention 2) spatio-temporal hierarchical transformer decoder.
While I was reading your paper, I got some questions in my mind
The perception experiments in Table 2, BEVFormer static slightly underperforms TBP-Former static (BEVFormer static 44.4 vs TBPFormer static 44.8). In my reproducing experiments, when BEVFormer static is fed with 224 x 280 images (instead of 900 x 1600), the performance drops to 39.0, which is quite smaller than TBPFormer static (nearly 6 points). Because your TBPFormer static does not use your first idea, I can speculate that the performance gain nearly comes from your second idea. That means the performance difference between TBP-former static and TBP-former must be around 6 points. However, Table 2 informs me that the difference is 1.4 point (44.8 v.s. 46.2).
Could you give me some hints about what I am missing in the discussion above? Actually, I have been struggling implement your work and producing the results reported in your paper and I really want to know what am I missing in my implementation.
Your feedback is greatly appreciated.