concept-graphs / concept-graphs

Official code release for ConceptGraphs
MIT License
437 stars 70 forks source link

Bad result in running concept graph on ARKITScene dataset #63

Closed JiaxiXie closed 3 months ago

JiaxiXie commented 3 months ago

I am now trying to run concept graph on ARKITScene dataset, and encountered a few problems. I have finished watching the tutorial and successfully generated the mapping on rerun.io using Replica Dataset. Here are the changes I made for implementing concept map on ARKITScene dataset:

Screenshot 2024-08-07 at 23 05 04

Here is the problem: from the clip from rerun.io below that there are overlapping of items. I am really not sure what causes those. I would definitely appreciate it if someone could share some opinions and suggestion on the problem. Thanks a lot.

Screenshot 2024-08-07 at 23 08 33
krrish94 commented 3 months ago

From the visualization, it looks like the camera intrinsics / extrinsics are incorrectly configured.

I haven't used ARKitScene, so I'm unsure what the camera conventions for that dataset look like. You also seem to be using the value png_depth_scale from the Replica sequences. For most other datasets, that parameter is 1000 (except for TUM RGB-D, for which I believe it is 5000).

JiaxiXie commented 3 months ago

Here is the dataset format of ARKITScene

/[scene_id]
    /[scene_id]_frames
        /lowres_depth (depth images)
        /lowres_wide (color images)
        /lowres_wide_intrinsics  (camera intrisics)
        /lowres_wide.traj (camera extrinsics)

lowres_wide(color images)

lowres_depth (depth images)

lowres_wide_intrinsics (camera intrisics) is a single-line text file, space-delimited, with the following fields: width height, focal_length_x, focal_length_y, principal_point_x, principal_point_y Here is the specific intrinsic values for this dataset: 256 192 212.027 212.027 127.933 95.9333

lowres_wide.traj is a space-delimited file where each line represents a camera position at a particular timestamp

For camera intrinsics, I think ARKITScene 3DOD does not provide a specific value for depth_scale. However I load other parameters as stated in the file.

For extrinsics, I use the following code to print all the camera pose when running concept-graphs/conceptgraph/slam/rerun_realtime_mapping.py given stride=10 inrerun_realtime_mapping.yaml

unt_pose = dataset.poses[frame_idx]
 print(f"==============dataset poses===============\n{unt_pose}, \n{frame_idx*10+1}”)
 unt_pose = unt_pose.cpu().numpy()

Here is one example when processing line 261 in scene 40753679’s camera pose Print out is given below -

==============dataset poses===============
tensor([[-0.7846,  0.6184,  0.0439,  0.6561],
        [-0.2921, -0.3062, -0.9060,  0.9303],
        [-0.5469, -0.7237,  0.4209,  1.4225],
        [ 0.0000,  0.0000,  0.0000,  1.0000]]), 
261

The original line from lowres_wide.traj in the dataset is given below: Line 261: 6816.11971479 0.42390419235973026 1.3735961783384674 -2.116927535648726 0.65612 0.930267 1.42246 I did not see there is any error. Just wonder if that could mean that I am getting the right extrinsics

Just in case if there are any other ways to do verification, I also have .json for object annotation and .ply mesh(maybe helpful for verification?). I wonder if there are any other ways to verify. Thanks!

krrish94 commented 3 months ago

Quick comment on the visualization posted above. It seems like you are trying to use the ARKITScene dataset, but the rerun visualization seems to be from the Replica office0 sequence. Is that intentional? If not, there could be an error someplace in your config.

The preferred way to verify if the intrinsics and extrinsics are indeed correct is to visualize the 3D reconstruction produced by gradslam (running the script here on the first 100-200 frames of your dataset, with a frameskip of ~25, should suffice; you wouldn't need to run on the full sequence). It is hard for me to tell by just looking at the dataset poses whether they're accurate, since I haven't used that dataset before.

JiaxiXie commented 3 months ago

Thanks for the suggestions!

krrish94 commented 3 months ago

The issue with gradslam is perhaps due to using the main branch? You will need to use the conceptfusion branch (https://github.com/gradslam/gradslam/tree/conceptfusion) to access features essential for this repo.

JiaxiXie commented 3 months ago

Thanks a lot! It works. But the result is still pretty weird shown below. There are potential problems coming in mind:

krrish94 commented 3 months ago

IMO, printing out the camera pose isn't an issue (the rounding off happens when you print most arrays/tensors). You could try loading in double precision and then casting to float, but I don't think that'd be a major factor.

As for ARKitScenes, I've unfortunately got no experience working with it. I'd double check the intrinsics, camera poses (and axis conventions) carefully.

I looked up their documentation very quickly, and based on this description, the depth is in millimeters, so the depth scale parameter must be set to 1000.0

JiaxiXie commented 3 months ago

Thanks for the help during the week. I have resolved the problem by changing ways to load the camera pose. It really helps a lot. Also, thank you for providing such amazing code!

Screenshot 2024-08-18 at 02 08 12 Screenshot 2024-08-18 at 02 01 37