TuSimple / centerformer

Implementation for CenterFormer: Center-based Transformer for 3D Object Detection (ECCV 2022)
MIT License
293 stars 28 forks source link

some questions about nuscenes multi-task support #17

Open Liaoqing-up opened 1 year ago

Liaoqing-up commented 1 year ago

Thanks for releasing the nuscenes dataset code support. I have some questions about the implement of the multi-tasks. I see in the code that you define obj_num=500 for each task and then the task_id will be added to the pos embedding to identify each task in rpn transformer. But unfortunately, the computation increases, and my machine directly throw the error that the cuda memory OOM. As for the implement of multi-task, my intuitive idea is that each task has its own head during the generation of heatmap. Then, all heatmaps are contacted to one tensor and generate top500 center queries, then sent to rpn transformer, Meanwhile, the pos feature is also the regular x and y coordinates. In the final output detection head, each task have their own detection head applying to transformer output features, which can reduce the increasing computation in transformer layer. This is my first thought, I wonder if you has experimented this way, is there any drawbacks? Could you share the effects or conclusions or something like that? It is very important to me. Thank you ~

Liaoqing-up commented 1 year ago

By the way, have you experimented the time sequence fusion through the rpn transformer in nuscenes dataset? How does it work?

edwardzhou130 commented 1 year ago

Thanks for releasing the nuscenes dataset code support. I have some questions about the implement of the multi-tasks. I see in the code that you define obj_num=500 for each task and then the task_id will be added to the pos embedding to identify each task in rpn transformer. But unfortunately, the computation increases, and my machine directly throw the error that the cuda memory OOM. As for the implement of multi-task, my intuitive idea is that each task has its own head during the generation of heatmap. Then, all heatmaps are contacted to one tensor and generate top500 center queries, then sent to rpn transformer, Meanwhile, the pos feature is also the regular x and y coordinates. In the final output detection head, each task have their own detection head applying to transformer output features, which can reduce the increasing computation in transformer layer. This is my first thought, I wonder if you has experimented this way, is there any drawbacks? Could you share the effects or conclusions or something like that? It is very important to me. Thank you ~

Hi, sorry for the late reply. I agree with you that the current method is a bit cumbersome. Some tasks may not need that much of center candidates. But there will be some issues if you select the top K centers from a merged heatmap:

  1. It is hard to merge the scores or select a suitable threshold for the center candidates. Some tasks may have lower heatmap scores than others.
  2. Different tasks may have the same high response region. I found it has better results if each task is dealt with separately.

I also found the computation cost increase is relatively small since the transformer part of CenterFormer is already lightweight. Hence, I choose to implement it in this way. If you still have the memory issue, consider reducing the batch size or obj_num.