Generalizability to non-RGB images

NVlabs / BundleSDF

[CVPR 2023] BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

https://bundlesdf.github.io/

Other

1.06k stars 116 forks source link

Generalizability to non-RGB images #74

Closed adiser closed 1 year ago

adiser commented 1 year ago

Hi authors, great work!

Would like to dig in more about the applicability of this repo for my use case.

Would the method be generalizable to non-RGB images?
Does this work assume a static camera pose?
How is the initial pose/coordinate frame of the object decided?
Any suggestion on extending this work if we have multiple cameras with known relative poses?

wenbowen123 commented 1 year ago

Thanks for your interest in our work.

1)If you mean gray image, yes, it should work out-of-box.

2)There is no assumption on that, the camera and object can move or be static in arbitrary combinations.

3)The center is chosen as the first frame's point cloud center, the rotation is parallel to the camera in the first frame

4)Extending to multi-cam would be quite interesting. If you want to enhance pose tracking, you can enable feature matching between cameras and put them into pose graph. If the pose quality is good, and you want to enhance the reconstruction, you can add rays from other cameras with their corresponding poses during Neural field training.

adiser commented 1 year ago

Got it,

Can I get your insights on the required quality of the segmentation mask of the first frame? As how I understand it in the paper, it is used to "seed" the foreground object which are then predicted in the next keyframe by the transformer-based correspondence network (among a handful of things). Correct me if I'm wrong, but I don't see any mechanism that acts as quality gates or equivalent that handles the case when we have poor segmentation mask in the beginning. Or is it handled implicitly by the overall process of Neural Object Field training and online PGO?

adiser commented 1 year ago

Furthermore, upon digging the README and the code

In the readme

root
  ├──rgb/    (PNG files)
  ├──depth/  (PNG files, stored in mm, uint16 format. Filename same as rgb)
  ├──masks/       (PNG files. Filename same as rgb. 0 is background. Else is foreground)
  └──cam_K.txt   (3x3 intrinsic matrix, use space and enter to delimit)

And in run_one_video

    if i==0:
      mask = reader.get_mask(0)
      mask = cv2.resize(mask, (W,H), interpolation=cv2.INTER_NEAREST)
      if use_segmenter:
        mask = segmenter.run(color_file.replace('rgb','masks'))
    else:
      if use_segmenter:
        mask = segmenter.run(color_file.replace('rgb','masks'))
      else:
        mask = reader.get_mask(i)
        mask = cv2.resize(mask, (W,H), interpolation=cv2.INTER_NEAREST)

In this case, if we don't apply use_segmenter, the script requires a segmentation mask from the user. However, it is said in the abstract that "the object is assumed to be segmented in the first frame only. ". So is it still necessary to run the segmenter?

wenbowen123 commented 1 year ago

back then there was not yet segment-anything and we were using semi-automatic tools such as grabcut to get the initial mask. Now with SAM, you can get decent mask, the quality of which should often be more than enough for BundleSDF. There is no mechanism for correcting the segmentations, but at the same time, BundleSDF is kind of robust to the noisy segmentations.

the use_segmenter is mostly for handy purpose. If you have precomputed masks, this allows to reuse them by loading from disk without computing them in every run.

adiser commented 1 year ago

Thank you!

monajalal commented 11 months ago

Dear @wenbowen123

Do you have any sort of quantitative or qualitative results that can show if SAM outperforms XMem for your work? I've been using XMEM so far to prepare masks for BundleSDF and I wonder if you may suggest SAM for better results for 3D bbox and mesh recovery?