aofrancani / TSformer-VO

Implementation of the paper "Transformer-based model for monocular visual odometry: a video understanding approach".
https://arxiv.org/abs/2305.06121
MIT License
71 stars 11 forks source link

camera intrinsic parameters #8

Closed spokV closed 11 months ago

spokV commented 11 months ago

Hi, Thanks for the great work. It seems that you are not using any of the intrinsic parameters of the camera. Does the KITTI data set already compensate for that?

aofrancani commented 11 months ago

Hey, thanks for trying out this work. You're right, I'm not using the camera's intrinsic parameters. You can find the intrinsic parameters in the KITTI metadata (calib.txt files), as I did here: https://github.com/aofrancani/TSformer-VO/blob/main/datasets/kitti.py#L122. However, my only input is the RGB images. Therefore, with an end-to-end deep learning approach, we expect the network to learn all necessary parameters internally during the feature extraction step. Since I'm not using the Essential matrix to estimate the pose, I don't explicitly use those intrinsic parameters.

spokV commented 11 months ago

Hi @aofrancani Thanks! Is it right to say that model will not generalize well on other cameras with different intrinsic parameters? Did you try it?

aofrancani commented 11 months ago

Yes, you are right about the generalization. I believe this is the major limitation of supervised deep learning methods in the context of visual odometry. Ideally, these methods require large-scale and diverse labeled data, encompassing different cameras, dynamic environments, and varying light and meteorological conditions such as rain, snow, direct sunlight, and night. Unfortunately, VO datasets are not currently large enough to fully explore the great potential of the Transformer architecture in handling extensive data.

And no, I haven't explored the generalization on different configurations and datasets (at least not yet). Perhaps you could try mixing datasets with different calibrations and using the intrinsic parameters as additional input to the model. However, this approach may only partially mitigate the generalization challenge. Another strategy is to explore transfer learning techniques, but achieving optimal generalization remains a significant challenge in the field. As I have observed in recent surveys, researchers prefer to adopt hybrid approaches, using deep learning models only in certain components of visual odometry (e.g. feature extraction, matching, depth estimation) and still incorporating geometric constraints to estimate the pose. They also make use of additional sensors, such as the IMU in visual-inertial odometry (VIO) to increase the model's performance.

spokV commented 11 months ago

Thanks again! From what I've seen in other VO solutions is that the intrinsic parameters of the camera are being used to remove the augmentation/distortion of each frame/image before they are propagated into the model. Could it be useful (in term of generalization) to do the same with your E2E model?

aofrancani commented 11 months ago

oh, I see what you mean. To be honest, I don't know, I think it can be useful... I was wondering if we could use distortion for image augmentation to increase the data and help with generalization... it should work, but I think we will only know by trying and evaluating.