Operations on frame-based approaches

Thank you for the great work!

I have a question about the comparative experiments. As mentioned in the Sec. 7 of the paper, To compare to previous approaches that operate on single RGB or RGB-D frames, we first obtain predictions on each individual frame, and then merge all predictions together in the 3D space of the scene, does it mean that for each frame, we forward the model and get a 3D bounding box prediction? But I cannot find the correspondence annotations between 2d bounding boxes and 3d boxes. I wonder how to train these models without such annotations, namely, for an object in a rgbd frame, how to obtain the groundtruth 3d bounding box? Or you just extract the 2d detection results and project them to 3d space via camera parameters?

Would you mind provide some processing scripts? Thanks in advance.

Sekunde / 3D-SIS

Operations on frame-based approaches #35