DeformableFriends / NeuralTracking

Official implementation for the NeurIPS 2020 paper Neural Non-Rigid Tracking.
https://www.niessnerlab.org/projects/bozic2020nnrt.html
MIT License
185 stars 37 forks source link

How to perform depth integration with the predicted deformation #2

Open BaldrLector opened 3 years ago

BaldrLector commented 3 years ago

Hi, @pablorpalafox @AljazBozic , thanks for your kindly sharing, I learn so much from this source code, and I love it!

I have some questions:

  1. However, it seems the depth integration is not implemented at this repository. Could you shed some light on how to perform depth integration, as the gif presented? Is there any code for reference?

  2. I plan to perform non-rigid reconstruction form RGBD images with this method. Now i can get rgb image and depth image, how should I prepare other needed data (the graph nodes, edges, graph_edges_weights, graph_clusters, pixel_anchors and pixel_weights ) ?

Thank you in advance, your reply will vary appreciated!

pablopalafox commented 3 years ago

Hi @BaldrLector, thanks for your interest in the project!

Regarding 1., unfortunately we didn't have the time to properly clean up the part of the code for depth integration. But it's basically an implementation of DynamicFusion.

As to how to get the other needed data, you can check out the Data section in the readme. You should first download DeepDeform by following the instructions in that repo. Then use the link we provide in the readme to get the additional data. You should then merge both datasets, such that within a given sequence folder (see Data Organization of DeepDeform) you have all the data from DeepDeform plus the graph nodes, edges, etc.

Hopefully this clarifies it a bit more. Thanks for pointing it out.

BaldrLector commented 3 years ago

Hi @BaldrLector, thanks for your interest in the project!

Regarding 1., unfortunately we didn't have the time to properly clean up the part of the code for depth integration. But it's basically an implementation of DynamicFusion.

As to how to get the other needed data, you can check out the Data section in the readme. You should first download DeepDeform by following the instructions in that repo. Then use the link we provide in the readme to get the additional data. You should then merge both datasets, such that within a given sequence folder (see Data Organization of DeepDeform) you have all the data from DeepDeform plus the graph nodes, edges, etc.

Hopefully this clarifies it a bit more. Thanks for pointing it out.

  1. Got it, thanks, your reply is very kind, and I appreciate it. I will try to implement the dynamic fusion with the predicted deformation. By the way, will you plan to release this part code in the future?

  2. Sorry for the confusing description, I mean if I want to utilize new data ( capture by my self with Kinect azure, there are only RGB and depth images), how should I do to get graph nodes, edges, graph_edges_weights, graph_clusters, pixel_anchors and pixel_weights?

pablopalafox commented 3 years ago

Oh, got you. We are planning to clean up the code for generating this part of the dataset. It wasn't a priority, since we also released the corresponding graph data for DeepDeform, but we'll keep it mind now that you mentioned it. Essentially you need to sample nodes in the source depth map. Then compute your graph edges using geodesic connectivity and find your anchor nodes for each pixel, as well as their weights. We'll try to release the scripts for this relatively soon. Thanks for letting us know.

BaldrLector commented 3 years ago

Oh, got you. We are planning to clean up the code for generating this part of the dataset. It wasn't a priority, since we also released the corresponding graph data for DeepDeform, but we'll keep it mind now that you mentioned it. Essentially you need to sample nodes in the source depth map. Then compute your graph edges using geodesic connectivity and find your anchor nodes for each pixel, as well as their weights. We'll try to release the scripts for this relatively soon. Thanks for letting us know.

Got it, thanks for your responsive reply! I will close this issue now. And if there is any update, I will update it under this issue.

BaldrLector commented 3 years ago

@pablorpalafox Hi, Here to me again. After reading the referenced paper, I am writing code to reconstruct meshes with predicted deformation using Projective TSDF. If I am right, the core I need to do is:
1) build canonical TSDF volume(weight, tsdf value, and color), record each voxel center coordinate 2) Init TSDF with first RGBD frame( in this resp, it is denoted as source) with Camera parameters(fx, fy, cx, cy). 3) for next all RGBD frames, we warp each voxel center to the live frame with predicted deformation and translation. compute Projective tsdf value, and update the canonical volume.

Now I finished the 1) and 2) steps, I face a problem at step 3)

How to rotate and translate each voxel center with predicted deformation?

Although the resp provide the "warp_deform_3d" function, which utilizes the pixel_anchors pixel_weights, node_positions, node_rotations and node_translations. Now we have node_positions, node_rotations, and node_translations. How I compute the pixel_anchors pixel_weights for each voxel?

Algomorph commented 3 years ago

@pablorpalafox, I'm also interested in the DynamicFusion portion of the code, whenever you can get to it. I'll add a +1 here so others can simply upvote in the future.

pablopalafox commented 3 years ago

Hi @BaldrLector and @Algomorph!

We're now working on a coming deadline, but we'll definitely let you guys know if we eventually upload the reimplementation of DynamicFusion. @BaldrLector, your pipeline makes sense, the problem is that the warp_deform_3d function you mention is not exactly the one used for the final reconstruction, i.e., the dynamicfusion part is a separate repo. Again, sorry for the inconvenience of not releasing the whole thing, but we wanted to have a clean repo with the tracking part only, since in any case it was the main contribution of the paper. Anyways, we'll let you know :)

BaldrLector commented 3 years ago

@pablorpalafox thanks for your reply, I will keep work on the reconstruction part, and I am still looking for your new info eventually. By the way, have a nice year :)

shubhMaheshwari commented 3 years ago

Hi, @pablorpalafox @AljazBozic, This is a groundbreaking work with so many applicants. thanks for this source code. I am also trying to implement 3D Non-Rigid Reconstruction. Your approach to use multiple keyframes for reconstruction is very promising for future work.

I am confused about the integration of the various keyframes(every 50 frames in the sequence). Previous methods take an initial frame (generally it is the frame at time step-1). Then we keep on updating the canonical pose using only the RGBD image at the next timestep. Specifically:

  1. For a particular timestep, is the warp field estimated between the estimated canonical pose(extracting RGBD using rasterization) and every keyframe, and then somehow these all warp fields are averaged out.
  2. Do you update the deformable graph similar to DynamicFusion at each timestep.
  3. Or is it more like you update canonical pose only for 50 frames. eg. 50-51, 50-52,.. 50-100 and then find merge all key frames poses to get the final canonical pose.

@BaldrLector if you can provide some insight that would be very helpful.

Thank you everyone.

Algomorph commented 3 years ago

@shubhMaheshwari and @BaldrLector , it seems like all three of us are working on the same thing: trying to integrate the neural tracking code from the current repository into a complete DynamicFusion pipeline.

Would either of you like to pool our efforts together?

What I have, at time of this post: 1) Open3D-based TSDF (dynamic VRam allocation, voxel hashing) -- this is really just the stock Open3D implementation, but it's very good. I'm able to build it from source and have made contributions to Open3D, so I think I would be able to modify it for the fusion step. 2) Isosurface extraction using Marching Cubes from above (also not my implementation) 3) A way to deform the extracted mesh to frame t-1 using dual-quaternion blending (through a small C++ library with port to python, I believe it's single-threaded) 4) A way to render the mesh to a depth image (generating input for NeuralTracking), by using PyTorch3D

What is still missing: 1) Integrating all of this together 2) A way to fuse frame-t data into the frame-0 TSDF 3) Topological updates to the motion graph, like in DynamicFusion

If you'd like to collaborate, please, reply so here. I'll email you at your respective email in your GitHub profile, and we can continue the discussion.

shubhMaheshwari commented 3 years ago

@Algomorph definitely that's a great idea.

BaldrLector commented 3 years ago

Hi, @shubhMaheshwari @Algomorph I am working on the reconstruction part. I refer to the pipeline of Fusion4D, which produce the most high-fidelity reconstructed meshes.

until now I finished: (all following codes are written with python, cuda part are written with numba.cuda)

  1. init TSDF from a depth image (written with cuda)
  2. integrate depth to TSDF volume with predicted deformation (written with cuda)
  3. compute anchors and weight for voxels (written with cuda)
  4. volume deformation (written with cuda)
  5. key volume and reference volume blending (written with Cuda, runnable, but here a bug that leads to a fatter mesh)

I almost finished the reconstruction part, but there are a bug to be fixed:

  1. I do not utilize the voxel hashing, so the volume size is vary limited (i.e 256 x 256 x 256)
  2. there a bug in the volume blending part, which makes the integrated mesh become fatter.
  3. I do not write Topological updates, here I just re-sample graph nodes from the reconstructed mesh, I will add this part at the future.

I'd like the idea of collaboration, please contact me. : )

shubhMaheshwari commented 3 years ago

Hey @pablopalafox and @AljazBozic, I need some help with the keyframe integration.

  1. During non-rigid registration, for some timestep T. Neural tracking will take the source frame as the last keyframe, floor(T/50) and the target frame would be T. Is this correct?
  2. Unlike @BaldrLector, I am updating the graph nodes similar to DynamicFusion. My second question was how should I update the graph nodes such they correspond to a particular keyframe. Should I just translate the graph nodes using the node translation calculated by Neural Tracking, for the previous keyframe and current keyframe?
  3. Also how should I update the pixel anchors and pixel weights, for the new keyframe since we might not necessarily have a segmentation mask for that keyframe?
Algomorph commented 3 years ago

@pablopalafox and @AljazBozic, do you guys regenerate the graph at every frame or is the graph persistent throughout the sequence processing (with, potentially, topological updates such as in DynamicFusion that @shubhMaheshwari mentions above)?

Also, what voxel size and TSDF truncation distance do you use?

AljazBozic commented 3 years ago

Hi @Algomorph,

The deformation graph is persistent throughout the sequence, the only changes are addition of new nodes, or marking existing nodes inactive (basically removing them). But the canonical positions of nodes are not updated, because when we tried re-generating graph every frame, that often lead to tracking drift. The graph edges are updated at every frame though, using the connectivity of the mesh.

For depth integration we used voxel size of 0.5 cm and truncation of 2.5 cm.

Let me know if you have more questions :)

Algomorph commented 3 years ago

Thanks, @AljazBozic! I believe @shubhMaheshwari also has some questions above that are also relevant to the rest of us. I think in (1) above, he's asking whether for every 50-frame "keyframe" segment of the sequence, do you estimate the tracking between:

a. [frame 0] and [frame t], or b. [frame t-1] and [frame t]?

It is my understanding that DynamicFusion-style algorithms (including Fusion4D) follow schema (b) above, whereas there are some others (SobolevFusion/KillingFusion) that follow schema (a). Which schema do you use?

shubhMaheshwari commented 3 years ago

Also, can you elaborate on marking existing nodes inactive (basically removing them) part?
The only issue I am facing is with the graph update.

For example, I am able to register simple movements such as shirt (val/seq009)

https://user-images.githubusercontent.com/22934809/125134186-a3510400-e124-11eb-97b4-da8e9703a9f6.mp4

But with a complex example such as human(train/seq071) where the topology changes I am unable the graph update is generally wrong. Look at the hands in the below example.

https://user-images.githubusercontent.com/22934809/125134556-2bcfa480-e125-11eb-8655-b23a5cc21ca3.mp4

This is happening with or without a keyframe update.

My steps for adding graph nodes is similar to dynamic fusion and create_graph_using_depth.py.

  1. After running marching cubes on TSDF, on the obtained canonical model. Find vertices outside of node coverage.
  2. Sample nodes from these vertices such that their minimum distance equals node coverage
  3. Discard nodes with less than 2 minimum neighbors
  4. Compute graph edges and their weights for new nodes. Also, compute graph clusters.
  5. Recompute pixel anchors and weights for the source frame

@AljazBozic @pablopalafox thank you for your time. This is the only problem I am facing. Once I have fixed this I think I can have completed your non-rigid registration pipeline. Kindly help me in this step. I could also create a pull request or zip and email the code if that helps.

Algomorph commented 3 years ago

Bump.

Algomorph commented 2 years ago

Thanks, @AljazBozic! I believe @shubhMaheshwari also has some questions above that are also relevant to the rest of us. I think in (1) above, he's asking whether for every 50-frame "keyframe" segment of the sequence, do you estimate the tracking between:

a. [frame 0] and [frame t], or b. [frame t-1] and [frame t]?

It is my understanding that DynamicFusion-style algorithms (including Fusion4D) follow schema (b) above, whereas there are some others (SobolevFusion/KillingFusion) that follow schema (a). Which schema do you use?

I'm going to reply my own question here, almost a year later. The answer can be obtained from the NNRT article section 4.5 and appendix D.3. The correct answer is neither (a) nor (b). The flow network+differentiable optimization mechanism is trained on matching (very) sparse sequence frame pairs that are ~50 frames apart, so, naturally, the algorithm really only works well on these intervals. Hence, so-called "keyframes" are sampled from the sequence every 50 frames, and the tracking is performed between the last-encountered keyframe and the current frame. This requires a separate motion graph to be used for each keyframe, which is initialized with the cumulative motion graph with transformations applied to the nodes (translations added to node positions, rotation matrices reset to identity).

In order to integrate the current frame into the canonical volume (or "reference" volume at frame 0), the motion from the latest "keyframe" motion graph is combined with the cumulative motion graph (translations added, rotation matrices multiplied).

The system relies heavily on the accuracy of the correspondences, so there are two additional filtering systems that are not present in the code. I've implemented the bi-directional consistency to the best of my knowledge in my fork (https://github.com/Algomorph/NeuralTracking). My collaborators and I have a pretty decent idea about how to implement the multi-keyframe consistency as well.

Algomorph commented 2 years ago

Hey @pablopalafox and @AljazBozic, I need some help with the keyframe integration.

  1. During non-rigid registration, for some timestep T. Neural tracking will take the source frame as the last keyframe, floor(T/50) and the target frame would be T. Is this correct?
  2. Unlike @BaldrLector, I am updating the graph nodes similar to DynamicFusion. My second question was how should I update the graph nodes such they correspond to a particular keyframe. Should I just translate the graph nodes using the node translation calculated by Neural Tracking, for the previous keyframe and current keyframe?
  3. Also how should I update the pixel anchors and pixel weights, for the new keyframe since we might not necessarily have a segmentation mask for that keyframe?

I'm going to answer all the questions by @shubhMaheshwari in order with my best guess ATM, for posterity's sake. I'm hoping @AljazBozic will at least correct me if I get something wrong here.

  1. Yes, this is correct, i.e. see my answer about the keyframing mode above. However, this still doesn't explain the statement in the NNRT paper in section 4.5 "In addition to the dense depth ICP correspondences employed in the original method [DynamicFusion], which help towards local deformation refinement, we employ a keyframe-based tracking objective." The original depth ICP correspondences were based on a t-1-->t (previous-to-current-frame) mode back in DynamicFusion, where the mesh extracted from the canonical (frame 0) was forward-warped to the current frame (t) and its depth was essentially rendered (raycast) onto the current frame in order to compute the error and gradient. Since, in the authors code, the optimizer doesn't involve any rendering code whatsoever and instead relies directly on the RGB-D image at the keyframe (instead of t-1) as the "source" image and point-cloud, it stands to reason that the authors first apply the neural tracking to approximate the node transformations and then render and refine the transformations using their own DynamicFusion implementation.

  2. I don't think authors perform topological node graph updates at all. It doesn't seem to be necessary for the DeepDeform benchmark, which checks reconstruction of sub-sequences that are only 100 frames long. Likewise, I don't think they perform rigid ICP to compute and update camera extrinsics either prior to the non-rigid optimization, since the reconstructed scenes they show videos of don't have any camera motion, while the authors rely heavily on foreground masks (which obscure most static parts of the scene that work better for camera motion tracking). That is not to say that these things cannot be implemented and added in.

  3. I think the authors have the segmentation masks at all the keyframes, which they probably attained using the same method as in DeepDeform. They're just not included with the dataset (only masks for actual frame-pairs are included). The salient-object-detection mask generator I include in my code (https://github.com/Algomorph/NeuralTracking) seems to be a decent solution for that problem, but simply thresholding on the object depth seems also sufficient for many of the scenes in the test split.