alexklwong / void-dataset

Visual Odometry with Inertial and Depth (VOID) dataset
Other
119 stars 9 forks source link

Script(s) to extract from raw data to the format in VOID #6

Closed rakshith95 closed 2 years ago

rakshith95 commented 2 years ago

Hello, Can you share the codes used to convert from the raw bag file / however the sequences were captured into the format you currently use for the dataset, i.e. groundtruth, sparse, image, validity_map, K.

I would like to use data I've captured in addition to the ones you've already provided, so it would be super helpful if you could share those.

Thanks!

alexklwong commented 2 years ago

Sorry for some reason I don't get alerts on this repo. Sure yeah, you can refer to

https://github.com/alexklwong/void-dataset/issues/5

to read from the raw dataset. Let me know if you have more questions.

rakshith95 commented 2 years ago

Hello @alexklwong , thanks for getting back. I'm having some trouble building xivo to generate the sparse points for my captured videos, but i have a few questions regarding the generation of the data used:

  1. The rate of the images in the rosbag seems to be about 7-8 fps ( I just checked it for visionlab0). The D435i outputs the images at a rate of 30fps for the 640x480 resolution if I'm not wrong, so I assume you use only a subset and not every frame. How is this determined? Do you simply sample, say 1 out of every n images, or do you choose some 'keyframes'? If it's the latter, what is it based on?
  2. Since you obtain the sparse depth from the ground truth, would doing something like computing 1500 keypoints for each image and then sampling the ground truth for those point locations work in a similar way, or for applications such as the calibrated-backprojection-network is it necessary that the sparse points are tracked over the 3 frames?
alexklwong commented 2 years ago

Right,

  1. I think the rosbag should have all the frames. The dataset, however, is a result of CORVIS (alpha version of XIVO) and has a subset of the frames. How the frames were filtered were based on the sufficient parallax from the previous frame. Suppose that we are at time t, we skip any frame after t that has less than 1 cm of translation until we have a frame at t + \tau that has a translation of at least 1 cm from frame at time t.

  2. Yes that's correct, computing 1500 key points is sufficient for each frame to simulate a similar set up. The 1500 points contain both inliers and outliers (the inlier set really is around 60 - 150 points, this is the void150 dataset). The key points do not need to be tracked across frames for calibrated-backprojection-network to work since it only takes a synchronized image, sparse depth map, and calibration as input. During training we also don't assume that the points are tracked (in reality yes inliers do appear across frames).

XIVO uses a corner based detector, so if you want to simulate it you can use: https://github.com/alexklwong/learning-topology-synthetic-data/blob/master/setup/setup_dataset_scenenet.py#L72 as an example, or even: https://github.com/alexklwong/learning-topology-synthetic-data/blob/master/setup/setup_dataset_scenenet.py#L99

rakshith95 commented 2 years ago

Thanks a lot for the information! Closing the issue now

rakshith95 commented 2 years ago

Hello @alexklwong , in this , should line 35, N_INIT_CORNER = 15000 be 1500 instead of 15000?

alexklwong commented 2 years ago

Oh no, that's correct should be 15000. This is the number of initial corner points to be detected. This is because we noticed that Harris tends to detect large clusters of points around a location, so you may get ~100 points near a single corner. This might be because we didn't tune Harris since it had to be ran for a large number of scenes. So our fix was for it to detect more points for k-means, which will then optimize for the 1500 means. https://github.com/alexklwong/learning-topology-synthetic-data/blob/master/setup/setup_dataset_scenenet.py#L85

rakshith95 commented 2 years ago

Oh I see, thank you