facebookresearch / PoseDiffusion

[ICCV 2023] PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
Other
718 stars 42 forks source link

About Sampson error clamping and min_matches #34

Closed jhq1234 closed 5 months ago

jhq1234 commented 7 months ago

Thank you for sharing your code! I really appreciate it.

I have a question about Sampson error clamping

Before clamping, the average Sampson error was around 22844 in the apple sample dataset. However, I am wondering why a threshold of 10 was chosen. The scale of most errors is much larger than 10, so is it okay to use such a small threshold? I am worried about how to find the proper clamping threshold value when I try to apply posediffusion to my custom data.

And also, I am curious about what criteria the 'min_matches' in the following code is set based on. Should I find the appropriate value myself when using this value in custom data? The custom data could be images from a domain that PoseDiffusion has not learned.

jytime commented 7 months ago

Hi Jangho,

We pick a relatively strict threshold to ensure that only the correct matches would affect the gradients. The assumption here is, 2D matches from the off-the-shelf tools (such as SP+SG) would be noisy. The inaccurate matches would have a huge loss value and hence huge gradient value (e.g., over 20000 as you mentioned), which may dominate the the guided sampling to a wrong direction. For custom data, as in our trials, the value of 10 would usually work well.

If you want to look for a proper clamping threshold by yourself, I would recommend start from even stricter values (e.g., 1) and gradually add it 500. In my experience, the terms with a vale over 500 would not be very beneficial. To avoid the threshold is too small, another way is to check the final inlier match number. At the last step of GGS, the number of inlier matches per frame (valid_match_per_frame) should be ideally more than 50.

It would be fine to keep min_matches as default. If you want to further ensure that "only valid matches would be considered", you can add its value, but I guess this is not necessary.

Btw, we have a stronger method called VGGSfM (https://vggsfm.github.io/ ). It is more robust to unseen data. We are going to release its code soon.

Best, Jianyuan

jhq1234 commented 7 months ago

@jytime Thank you for a quick response!

I really appreciate the good advice you've given me; that question has been resolved. I have one more question, though. When it's mentioned that training used a V100 GPU with 162 images, does this refer to 162 distinct scenes? For example, when referring to multiple images of the same object as an 'image bundle,' does it mean that the training was conducted with 162 different image bundles?

jytime commented 7 months ago

It means 162 images in total, which should usually be around 10 scenes, i.e., in your word, image bundles. You can check the code here "max_images":

https://github.com/facebookresearch/PoseDiffusion/blob/f1360a48b5172a5622f2925ea09fbeb45ccb7b2d/pose_diffusion/util/train_util.py#L28

jhq1234 commented 7 months ago

Thank you for your kind response! May I ask you one more question? It was mentioned that a diffusion model has been trained on 10 scenes. Does this mean that PoseDiffusion generalizes well in sparse datasets? It's amazing that a diffusion model trained on 10 scenes works so well. To sample good poses using PoseDiffusion, is it not necessary to train on large datasets similar to Stable Diffusion? I'm curious about your opinion!

jytime commented 7 months ago

Hi I am a bit confused about the question description. The model was trained on Co3D dataset, which contains more than 30K videos (scenes). "10 scenes" mentioned above is the number for each iteration.