TRI-ML / vidar

Other
573 stars 67 forks source link

Questions about SESC #61

Closed qsisi closed 9 months ago

qsisi commented 9 months ago

Thanks for open-sourcing this repo, here I have several questions to ask for @TakayukiKanai

In SESC implementations, the RelativePoseNet contains each camera's extrinsic as parameters for every scene of the DDAD dataset(named such as (1_@@ofthe@@_extrinsicsID:152)), where the extrinsic were directly obtained from. Then the extrinsic were trained with gt_extrinsic using the pose_consistency_loss. It looks very strange to me because the extrinsic is not outputted from any pose nor depth, but directly from a dictionary(self._param_dict) in RelativePoseNet, so what is the relationship between the "learned extrinsic" and depth & pose?

Looking forward to your reply@TakayukiKanai

TakayukiKanai commented 9 months ago

Hi, thanks for your interest! Happy to answer the question!

the RelativePoseNet contains ...

Yes, that's true. Here ``Extrinsics'' is defined as relative pose from the main camera, thus they are just 6DoF (translation+euler), are provided by each sequence and each camera, and are learned through the backpropagation.

Then the extrinsic were trained with gt_extrinsic using ...

``trained with gt_extrinsic'' is not correct.

Each `relative pose from the main camera (= self._param_dict)'', ego-motions per each camera (ResNet18-based NN, and called aspose` generally in the codebase), and depth prediction (modeled by ResNet18-based NN) are used for training whole pipeline.

So, long-in-short, computed_pose provides ego-motions for each camera and ext_pred_forward is the relative position from the front-mounted camera, and a combination of them provides all camera locations from the front-mounted camera of the t=0.

qsisi commented 9 months ago

Thanks for your prompt reply. I still got a bit confused :)

Here ``Extrinsics'' ... are provided by each sequence and each camera ...

In my understanding, the ''Extrinsics'' refers to the geometric transformation between cameras, which should be the same throughout the whole DDAD dataset. i.e. for DDAD with 6 cams, there should be just 5x4x4 matrices(cam2maincam) to learn? However, the (self._param_dict) contains 1004 extrinsic params, which means 5(cam number) x 200(sequence number) + 5(maybe shift?), does it mean each sequence of DDAD has a different extrinsics setting? So SESC learns extrinsic for every sequence of DDAD instead of a generic extrinsic throughout the total 150 DDAD training sequences?

TakayukiKanai commented 9 months ago

does it mean each sequence of DDAD has a different extrinsics setting?

Yes. Even though they are not completely independent, camera configurations are different from location to location where each sequence is recorded (e.g. SF, ANN, Japan, etc.).

Therefore, this study recognizes all extrinsics as independent (but similar) and learned each of them with the averaged parameters for each camera. So, 150 sequences x 5 cameras x 6Dof + 5 cameras x 6DoF are learned and results are demonstrated in the paper.

image

BTW the extrinsics for val split (000150~000199) are just defined but not used anywhere so please ignore them.

qsisi commented 9 months ago

Thanks for your reply, my focus is extrinsic learning, and I found that I may misunderstand the meaning of "self-calibration"...

So in summary, "self-calibration" means the extrinsic is not directly predicted by the network, instead, the extrinsic is learned along with depth as well as pose during training. At first, I thought the network could directly predict extrinsic given image data (such as a DDAD sequence), but it's not, the network should first perform self-supervised training on that given sequence, and after training, the extrinsic of that given sequence is acquired. Am I understand correctly?

TakayukiKanai commented 9 months ago

Yes, completely correct.

Ego-motion and depth are predicted via Neural Networks (thus it can be obtained from RGB) but extrinsics is just a 6D tensor assigned to each sequence. So our learned extrinsics cannot be generalizable for new domains with RGB input.

qsisi commented 9 months ago

Thanks again for your prompt reply!