facebookresearch / PoseDiffusion

[ICCV 2023] PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
Other
718 stars 42 forks source link

Several questions about paper #26

Closed amokame closed 11 months ago

amokame commented 1 year ago

Thanks for your excellent work. I have several questions about the paper. (1) Diffusion model: are images the condition like the text prompt in T2I diffusion model? Start with the random camera xt, denoise then get the x0?

(2) How do you calculate the p(I|xt)? camera pose1, 2 is known, so calculating the keypoint1's position under camera2 and calculate the difference between it with keypoing2?

(3) In Fig.3, there is a p(y|I) in the geometric guidance part, what is this? I thought you are using the gradient of p(I|x) to guide the sampling.

jytime commented 11 months ago

Hi,

(1) May I ask what is the meaning of T2I, text-to-image? To be honest I am not an expert in text-to-image diffusion. We do start from random noise xt (noisy cameras) and gradually denoise them to clean cameras x0. The usage of images can be seen here and here.

(2) Please review the "Sampson Epipolar Error" section in 3.3. We have camera poses 1 and 2, as well as keypoints 1 and 2 (by off-the-shelf detector). They conform to the epipolar constraint mathematically. In general, the extent to which they conform to the epipolar constraint serves as a metric for evaluating the accuracy of xt relative to I. The distribution of them provides the direction (gradient) to optimize camera poses.

(3) Sorry I think it is a type. I will double check it and update the paper if so.

Best, Jianyuan

amokame commented 11 months ago

@jytime Thank you so much for your reply. (1) T2I is for text-to-image. "start from random noise xt (noisy cameras) and gradually denoise them to clean cameras x0." I overthought about this part, got it now.

(2) Thank you for your explanation! I get the general picture now, and I will check the paper again, and the code.

I have tried your method, and it works very well for the images I took around the object. But the result for images taken in outdoor seems to be wrong, I think it relates to issue. Looking forward to try the RealEstate10k ckpt.

sungh66 commented 10 months ago

@jytime Thank you for your amazing work! I would like to know if our groud truth camera pose as input is absolute pose or relative pose? I want to build my own co3d data, and I want to confirm this.

jytime commented 10 months ago

@sungh66 you can choose any camera pose system for training, as long as it is consistent. But it is always better to normalize them relative to a pivot, which may be viewed as relative pose