facebookresearch / projectaria_tools

projectaria_tools is an C++/Python open-source toolkit to interact with Project Aria data
https://facebookresearch.github.io/projectaria_tools/docs/intro
Apache License 2.0
427 stars 55 forks source link

Slight depth/image potential mismatch in ADT dataset #81

Closed ArmandB closed 4 months ago

ArmandB commented 4 months ago

Hello, I'm looking to use the ADT dataset for depth estimation, taking into consideration dynamic factors such as people.

However, when I extract the data from the vrs files (synthetic_video.vrs, video.vrs, depth_images_with_skeleton.vrs) the depth data seems to have slight offsets with the images. I've tried this for the "Apartment_release_multiskeleton_party_seq123/1WM10360071292_optitrack_release_multiskeleton_party_seq123" sequence on the 'camera-slam-left' stream and the 345-2 stream.

For video.vrs, the timestamps of the below image (in which I've zoomed in are 4978173257862 for the greyscale and 4978173256999 (862ns apart) for the depth. As is visible in the right corner, the top of the stool is more cutoff in the depth image than in the grayscale. image

For synthetic_video.vrs the timestamps are 4978147143000 (main) and 4978139929000 (depth) (7ms apart) with the depth image seeming like there's a larger gap between the top of the stool and the edge of the frame. Although maybe this is due to motion blur and image noise. image

Am I missing something? Any help with using the Aria ADT dataset for depth estimation would be much appreciated. Thank you for any help in advance!:D

Edit: I'm using this time domain: time_domain = TimeDomain.DEVICE_TIME

nickcharron commented 4 months ago

Hi @ArmandB, thanks for your detailed question, with example images!

There are a few factors at play here that all have an effect on the results you see, so I want to carefully address each of them

  1. Pose errors:

Our ground truthing system isn't perfect, but it is definitely one of the highest accuracy methods to-date in literature for similar datasets. So keep in mind that some of the misalignment you see can come from this. These error sources are a combination of time sync errors, optitrack tracking errors, calibrations, and object modeling errors. If you read our research paper, we did thorough analysis on the errors and have numbers for average rotational and translation errors that you can expect for an object relative to the camera at any given frame. While analyzing the errors you are seeing, see if you think they fall within the expected error metrics that we provide. If so, then your code is fine and it's an error you will have to live with. If you think the errors are larger, then perhaps you are doing something wrong, or there's an error in our data/tooling (so please let us know!)

  1. Rendering Artifacts

We actually used different rendering engines & pipelines written by different people for producing the synthetic VRS vs the depth & segmentation images. So it's possible that some time misalignment, and how the GT poses were used (e.g., interpolation methods) differ between the two methods. This could explain what you are seeing. The other difference that has a significant effect is that we actually had to use a larger vignetting for the synthetic due to the available camera models in the rendered we used.

  1. Timing misalignment

The dt of 862Ns between the slam image and depth image is negligible, so I would not worry about that. This is probably just rounding errors somewhere in the tooling or GT pipeline. In fact, Aria is only timestamped up to the nearest microsecond due to the precision of the clock. We only store data in Ns to keep the tooling consistent and we don't want to restrict other data to the accuracy of Aria's clock. That being said, any errors beyond the nearest microsecond can likely be ignored. The larger error of 7ms between depth images and synthetic images is more concerning. It's possible that there are some mis-alignment issues with the rendering engine as I described above. Is this offset consistent? I can look more into it, but if it's a problem with the rendering then that would be harder to fix and may not have the bandwidth to address.

In short, from the few photos that you sent, it looks to me like you are using things properly and the discrepancies are just artifacts of how we generated the data. I'd be happy to discuss more, or look more deeply into specifics if you'd like.

ArmandB commented 4 months ago

@nickcharron Thank you for your speedy and thorough response!

  1. I will take a closer look at the error given in the paper to see if it falls between the expected error metrics.
  2. Good to know about the vignetting in the synthetic dataset (which is what I think I would go with due to the known challenges you mentioned with hand tracking until MPS comes in).
  3. I see. So 862ns is a rounding error but anything in the us or ms range could be an issue. I've only taken a look at 2 sequences just to try things out. The misalignment only seemed to be in this one.

Thanks for the feedback that it looks like I'm using things correctly and that it's likely just artifacts.

Closing this issue for now, but I will potentially re-open later or make a new issue if I find that the offset is consistent or widespread. Thank you again for all your help!

ArmandB commented 4 months ago

Ok, so I've investigated the time offset for 5 inputs.

depth.vrs/video.vrs (main) alignment - this is pretty good. In all my tests, the offset is <1us which you have said is negligible.

depth.vrs/synthetic_video.vrs (synth_main) alignment - there seem to be 2 potential issues here:

  1. depth and synthetic_video typically start misaligned by a few ms (8-14ms in my tests). This happened in all sequences I looked at except "Apartment_release_golden_skeleton_seq100_10s_sample".
  2. Their time between frames is different. depth.vrs has 33328000ns between frames. synthetic_video.vrs has a pattern of 33333000ns, 33333000ns, 33334000ns between frames. Thus, even if the timestamps start the same, every frame, the alignment will change by 5000-6000ms (5-6us). Over a couple hundred frames, this starts become ms of misalignment.

In the 2 instances I've seen where time synchronization is <5us (frame 0 of Apartment_release_golden_skeleton_seq100_10s_sample and frame ~1624 of Lite_release_recognition_BambooPlate_seq031, the synthetic and depth images seem to match up well with manual inspection. So it seems like synchronizing the frames would allow the use of synthetic_video.vrs and depth.vrs for research in depth estimation.

I've attached the excel data I've been using to check time alignment for synthetic_main, main, and depth in case it's helpful. I acquired this by getting the intersection of timestamps between vrs files and then printing out all timestamps in the intersection. An index of "0" means that it's the first frame in the intersection not the first frame in the .vrs file.
Aria ADT - main_depth Sync.xlsx Aria ADT Sync synth_depth.xlsx

Would it be possible from Meta's end to do this time synchronization/alignment, or, like you said, this would require too much bandwidth?

Edit: 5000-6000ns (5-6us)

nickcharron commented 4 months ago

Hi @ArmandB, thanks for sharing the detailed analysis. I will try to carve out some time early this week to look into this further.

nickcharron commented 4 months ago

Hi @ArmandB, thanks for your patience.

I did some analysis on the issue you brought up, and you are correct. There is a time alignment issue between real Aria images and their closest equivalent images in the synthetic VRS. You can take a look at my more detailed analysis here, but here is a summary. The time between synthetic images is slightly different from that of the real data (approx. 6 microseconds). This means that the alignment between images will drift over time as you described. Here's a graph showing the time difference for the golden dataset: img1

It turns out that the renderer uses a fixed frequency for images and this was slightly different from the real data (for a reason we don't fully understand). So the images won't align perfectly, however, note that the poses are sampled from a continuous trajectory so the poses used to generate the synthetic images are correct, just not sampled at the same time as the real data. The solution to this would be to change the renderer to take the real timestamps from the raw vrs and re-render all sequences. This is a significant amount of work that we will add to our planning as a feature request, but I'm not sure when that can be done.

Sorry for the inconvenience, and thanks again for providing the feedback and detailed analysis! Please let me know if there is anything else we can help with.

ArmandB commented 4 months ago

@nickcharron thank you so much for your thorough response!

No worries, I'm a bit new to depth estimation, so as I've been going around and looking at other datasets, I've realized really how accurate Aria ADT is even despite these small time differences. With this extra perspective, I'm realizing that a mismatch of a few pixels is really good compared to some alternatives that people have published papers on.

For other datasets like ScanNet there are cases where alignment is 40+ pixels off: https://github.com/ScanNet/ScanNet/issues/101 Similar issue when trying to align the point cloud for the EuRoC dataset using a script from google research: https://github.com/google-research/google-research/issues/472

Thanks again for your help!

nickcharron commented 4 months ago

@ArmandB that's great feedback, thanks for sharing! Happy to hear from a user that the accuracy of our dataset is much better than alternatives :)