Tangshitao / MVDiffusion

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion, NeurIPS 2023 (spotlight)
447 stars 21 forks source link

Issues about feeding the output to TSDF #32

Closed 0010SS closed 1 week ago

0010SS commented 8 months ago

Thanks, Mr. Tang for your awesome work! I have been generating a set of images using depth_fix_interval mode based on Scannet. However, when I feed the output into tsdf, including the poses, K, depth, and preds in the output log's file, it generates a weird mesh that does not seem to align, how can I solve this issue? Below is an example image:

Screenshot 2023-10-23 at 09 56 32
Tangshitao commented 8 months ago

The scannet poses are camera to world. Can you check whether the poses are correct?

0010SS commented 8 months ago

Screenshot 2023-10-23 at 16 39 11 As you said, I've transformed the camera-to-world matrix to the extrinsic matrix by taking the inverse of it using cam_pose = np.linalg.inv(cam_to_world). However, it turns out to give something like the picture above.

Are there any possibilities that I can get the code of how you manage to visualize the output images into mesh using tsdf-fusion-python? That would be a great help! Thank you so much!

Tangshitao commented 8 months ago

Can you successfully do tsdf fusion on the orginal scannet data? You can pose the codes here for us to analyze.

0010SS commented 8 months ago

Thanks for your reply! I cannot do TSDF fusion successfully on the original ScanNet data as well, the room creates weird shapes as above. I basically fed the data into tsdf-fusion-python, but customizing the input image and adding an inverse.

cam_intr = np.loadtxt("mvd_data/scene0009_00_0/K.txt", delimiter=' ')
  vol_bnds = np.zeros((3,2))
  for i in range(n_imgs):
    # Read depth image and camera pose
    depth_im = cv2.imread("mvd_data/scene0009_00_0/%d_depth.png"%(i*20),-1).astype(float)
    depth_im /= 1000.  # depth is saved in 16-bit PNG in millimeters
    # depth_im[depth_im == 65.535] = 0  # set invalid depth to 0 (specific to 7-scenes dataset)
    cam_to_world = np.loadtxt("mvd_data/scene0009_00_0/%d_poses.txt"%(i*20))  # 4x4 rigid transformation matrix
    cam_pose = np.linalg.inv(cam_to_world)
    # Compute camera view frustum and extend convex hull
    view_frust_pts = fusion.get_view_frustum(depth_im, cam_intr, cam_pose)
    vol_bnds[:,0] = np.minimum(vol_bnds[:,0], np.amin(view_frust_pts, axis=1))
    vol_bnds[:,1] = np.maximum(vol_bnds[:,1], np.amax(view_frust_pts, axis=1))

# Loop through RGB-D images and fuse them together
  t0_elapse = time.time()
  for i in range(n_imgs):
    print("Fusing frame %d/%d"%(i+1, n_imgs))

    # Read RGB-D image and camera pose
    color_image = cv2.cvtColor(cv2.imread("mvd_data/scene0009_00_0/%d_gt.png"%(i*20)), cv2.COLOR_BGR2RGB)
    depth_im = cv2.imread("mvd_data/scene0009_00_0/%d_depth.png"%(i*20),-1).astype(float)
    depth_im /= 1000.
    # depth_im[depth_im == 65.535] = 0
    cam_to_world = np.loadtxt("mvd_data/scene0009_00_0/%d_poses.txt"%(i*20))  # 4x4 rigid transformation matrix
    cam_pose = np.linalg.inv(cam_to_world)
    # Integrate observation into voxel volume (assume color aligned with depth)
    tsdf_vol.integrate(color_image, depth_im, cam_intr, cam_pose, obs_weight=1.)

    fps = n_imgs / (time.time() - t0_elapse)
    print("Average FPS: {:.2f}".format(fps))

    # Get mesh from voxel volume and save to disk (can be viewed with Meshlab)
    print("Saving mesh to mesh.ply...")
    verts, faces, norms, colors = tsdf_vol.get_mesh()
    fusion.meshwrite("mesh.ply", verts, faces, norms, colors)

The data is basically the output from your MVDiffusion. Thanks!

dengchcs commented 3 weeks ago

Hi @0010SS, did you manage to reconstruct the mesh?

0010SS commented 2 weeks ago

Hi @dengchcs, Yes, I have reconstructed the mesh successfully. The code works fine, but you need to tweak the intrinsic matrix bit by bit to make the meshes match.

dengchcs commented 1 week ago

Hi @dengchcs, Yes, I have reconstructed the mesh successfully. The code works fine, but you need to tweak the intrinsic matrix bit by bit to make the meshes match.

Thanks! I can reconstruct the geometry now (though the color reconstruction is still buggy...). My problem was that my camera's coordinate system seemed to differ from that of ScanNet.