NVlabs / latentfusion

LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation
https://arxiv.org/pdf/1912.00416.pdf
Other
213 stars 34 forks source link

How to apply to new unseen objects? #8

Closed mikkeljakobsen closed 3 years ago

mikkeljakobsen commented 4 years ago

Hi,

Thanks for this amazing work!

I'm currently looking into template based pose estimation using an RGB-D camera. I would like to try this out on my own data. As far as I understand, I only need to provide a small reference sequence of images of the object of interest including the poses, and then I should be able to use the pre-trained network to do pose estimation for that object.

I've looked into the example notebook, but I can't figure out how to proceed. Can you help me in the right direction?

Best regards, Mikkel

keunhong commented 3 years ago

Hi Mikkel,

Which part are you stuck on? To register your own input images you can use Open3D.

Thanks, Keunhong

mikkeljakobsen commented 3 years ago

Hi Keunhong,

I'm stuck on the first part: producing the reference data for the object that I want to detect. I've captured a color+depth sequence of the object and now I want to produce something similar to "reference" folders in the Moped dataset. I will try to see if I can use Open3D for this. Do I need to produce ground truth masks as well or is it only for evaluation?

Best regards, Mikkel

keunhong commented 3 years ago

Hi Mikkel,

Once you have installed Open3D, you can process a dataset like this:

python "$OPEN3D_DIR/examples/Python/ReconstructionSystem/run_system.py" \
  "$RESOURCE_PATH/config.json" --make --register --refine --integrate

The config.json we used is here: https://github.com/NVlabs/latentfusion/blob/master/resources/open3d_config.json

Also here's a script to capture a video from a RealSense camera: https://gist.github.com/keunhong/769cbc2dbf88f3e02d96378181673294

You will need masks as input for inference (they are an input to the reconstruction network). They're pretty easy to get. We used https://github.com/chrisdxie/uois or simple plane fitting also usually works well if there's no clutter.

mikkeljakobsen commented 3 years ago

Hi Keunhong,

Thanks for the RealSense capturing script - that was really helpful. I managed to capture a few RGBD+mask sequences of my object and then use Open3D to produce the .ply files. But how do I stitch them together to produce something like the "integrated_registered_processed.obj" which is used in the example notebook?

Best regards, Mikkel

keunhong commented 3 years ago

You can register them using Open3D.

Here are the relevant page: http://open3d.org/html/tutorial/Advanced/global_registration.html?highlight=registration

You first do global registration and then refinement.

If this doesn't work then you can do manual alignment by clicking on points. Here's a script we used (again, no guarantees on the quality): https://gist.github.com/keunhong/bdacd7a034bfec88852284c17a0157db

mikkeljakobsen commented 3 years ago

I still haven't managed to complete the last step. Can you help?

Also, are you able to share some of the scripts in the tool-section? Particularly, I'm interested in the "generate_realsense_masks.py" called by "tools/dataset/process_realsense_scan.sh". The masks I'm getting using UIOS are not great, so it would be interesting to see how you are doing it.

Best regards, Mikkel

mikkeljakobsen commented 3 years ago

I finally managed to get it working. Captured a small reference dataset for a yellow box. I got masks using simple HSV thresholding. Couldn't get UIOS to work properly. Then I captured some target images to test the network.

This is the visualization output from the example-notebook after fine pose estimation: latent_fusion_detection

This is the code I used to load the target observation:

idx = 28
path = Path('/home/mikkel/data/yellow_box_test_seq/04')
intrinsics_file = path / 'instrinsics.json'
mask_dir = path / 'mask'
if not mask_dir.exists():
    raise ValueError(f"Mask directory {mask_dir!s} does not exist.")

mask_paths = sorted(mask_dir.glob('*.png'))
valid_ids = [int(p.stem) for p in mask_paths]
depth_paths = [path / 'depth' / p.name for p in mask_paths]
color_paths = [path / 'color' / p.with_suffix('.jpg').name for p in mask_paths]

with open(intrinsics_file, 'r') as f:
    intrinsics_json = json.load(f)
    intrinsic = three.intrinsic_to_3x4(
        torch.tensor(intrinsics_json['intrinsic_matrix']).reshape(3, 3).t()).float()

color = Image.open(color_paths[idx])
color = np.array(color)
color = (torch.tensor(color).float() / 255.0).permute(2, 0, 1)

mask = Image.open(mask_paths[idx])
mask = np.array(mask, dtype=np.bool)
if len(mask.shape) > 2:
    mask = mask[:, :, 0]
mask = torch.tensor(mask).bool()

depth = Image.open(depth_paths[idx])
depth = np.array(depth, dtype=np.float32) / 1000
depth = torch.tensor(depth)

target_observation = {
            'color': color,
            'mask': mask,
            'depth': depth,
            'extrinsic': torch.eye(4),
            'intrinsic': intrinsic,
        }
target_obs = Observation.from_dict(target_observation)

In general the performance is not good. But it might be that I'm doing something wrong.

keunhong commented 3 years ago

Hi Mikkel,

Sorry for not responding sooner--I've been busy with the CVPR deadline.

Concave, untextured object like that are tricky especially with only RGB input since there aren't many cues for the network to infer the depth from.

If you do have depth for the capture, you can try training the network with depth as input. We tried this a while ago and it is able to resolve the ambiguity for these kinds of objects. It should be relatively easy to modify the training script to do this.

The pose estimation results will also only be as good as the reconstruction, so I'd double check the camera calibration etc and make sure everything lines up.

mikkeljakobsen commented 3 years ago

Thanks for your answer. I don't think I have time (and GPU power) to train the network myself. If you still have it available, can you share the weights you got from training the network with depth?

Otherwise I think I'll close this issue, since my original question got answered. Thanks a lot for your time, and good luck with your CVPR submission! :)