Fine tuning SiLK for 360 equirectangular image

3dvisionstudent commented 1 year ago

Hello. i am surprised at your work. thank you for sharing this.

Q : i am developing SfM pipeline with 360 image outside by a car riding road. i was using Super Point to extract features, but i found your good job so i am trying to apply SiLK for my pipeline. i tested SiLK for 360 image and i got good result. but i want to improve that and retrain for 360 images from your checkpoint. i looked your code over to understand. but it is difficult to find where i should edit to retrain for 360 images. what i want to do is add new augmentation function to rotate 360 image on unit sphere and remove keypoints on sky and vehicles(i have segmentation result). your comments will help me to do a lot. thank you

gleize commented 1 year ago

Hi @3dvisionstudent,

Sure, let me give you some pointers :

To change the dataset, you'll have to add a new dataset class here (c.f. megadepth as example to start from). Then, you will have to configure it here and use add it here to be used during training.
To change the augmentation, you can have a look at our random homography sampler here. In your case, you will probably have to write a similar class, and configure it here (class path in line 5, and its arguments in line 8+).

That should be enough for what you're trying to do. Don't hesitate to ask if you have further questions.

3dvisionstudent commented 1 year ago

@gleize Thank you for gold advises. i am trying to edit parts you mentioned. and i will share you the result after finishing.

3dvisionstudent commented 1 year ago

Thank you @gleize . i can complete a lot of tasks about fine tuning for my custom dataset from your advises.

but i got stuck in some problems. i want to extract keypoints on an image out of masked region(i have masks indicating sky and cars. i don't want to extract keypoints on that). so i am trying to remove keypoints extracted on training pipline in _init_loss_flow function by Flow.Constant(("normalized_descriptors", "logits")) with masks. i found that the outputs "logits" and "normalized_descriptors" are not in image coordinate. so i can't do masking logits and descriptors because i were not able to find image coordinates of the outputs. please some hint about this problem.

i hava "logits" and "descriptor" which are not in image coordinate
i have masks both for original image and wrapped image
i want to remove "logits" and "desciprtor" before matching to not extracted on sky and car region

gleize commented 1 year ago

Hi @3dvisionstudent,

In the same class where _init_loss_flow is defined, you can figure out the linear mapping between an input ("images" node) and an output ("logits" node) by running this line.

linear_mapping = self.model.coordinate_mapping_composer.get("images", "logits")
print(linear_mapping)

If you're using the default backbone (VGG-4), this should output something like this.

x <- tensor([1., 1.]) x + tensor([-9., -9.])

This indicates that both the "images" and "logits" operate on the same scale (factor of 1.0) and "logits" are shifted by 9 pixels (e.g. the position (0, 0) in "images" is at position (-9, -9) in "logits", which is out-of-bound, and thus not visible).

You can use the linear mapping to figure out how to reduce your mask spatial dimensions to match those of the "logits" and "descriptors" (they are the same).

xy_shift = linear_mapping.reverse(0).int()

This will give you the margin you need to remove from your mask (top and left only). Once removed, you can remove the right and bottom margin to match the size of the logits.

This should solve your problem.

3dvisionstudent commented 1 year ago

Hi @gleize. i have made it complete. thank you @gleize. i couldn't have done it without your help.

what i have done is to apply masks(indicating sky, cars) on correspondence maps both forward and backward. this is an example for forward case

the mask for an label image(cropped 332x332), the white area is available and the black parts means don't use parts(sky and cars)
and the correspondence map for the label image(the output of _get_corr function) and the white area means that there is correspondences(corr_forward>0), the black area means that no correspondence(corr_forward==-1)
apply the mask on correspondence map in order to remove correspondences on black parts in the mask. and then apply same things to corr_backward using warped mask. then check having mutual correspondences through keep_mutual_correspondences_only function
the result correspondence map

i am training silk with my custom roadview dataset. after finished, i will share you the result. thank you. your comments have been really helpful!

gleize commented 1 year ago

Hi @3dvisionstudent,

Glad I could help. This looks great. Looking forward for your results.

3dvisionstudent commented 1 year ago

Hi @gleize. i have tested about custom learning strategy. my idea doesn't work. the mean matching accuracy of using my custom evaluator with roadview is 97.3%(while pretrained model is 97.7%). there is not much gap between them. but features are extracted on sky and cars despite i removed GT corr on sky, cars during training. the bigger problem is after fine tuning with that strategy that the number of features on sky and cars increased much more. i think the reason why the number of feautres increases is backbone can't learn about semantic information for sky, cars area in this way...

i will try to do more experiments with silk to apply my project. thank you for teach about your good work.

the output of pretrained model.
the output of custom model trained by custom learning strategy.

3dvisionstudent commented 1 year ago

and i did additional experiments to reduce the size of descriptor. because i need to save memory usage. i have trained SiLK model changing dimension of descriptor 128->64->32 with COCO dataset. i found that mean matching accuracy, homography accuracy, AUC in hpatch eval doesn't drop significantly during reducing dimension of descriptor. that was amazing result. thank you for sharing nice work. @gleize

gleize commented 1 year ago

Thanks for sharing your results @3dvisionstudent.

i think the reason why the number of feautres increases is backbone can't learn about semantic information for sky, cars area in this way...

Unless it's actively learned, it seems unlikely yes. Have you tried to actively learn the keypoint score from the mask instead ? (i.e. change the binary cross-entropy loss to suppress the keypoint score when positioned on the mask, and revert to the standard behavior when outside the mask).

the output of custom model trained by custom learning strategy.

This second image seems to have a lot of keypoints. Are you selecting the top-k keypoints ? Or are you thresholding the keypoint scores with a constant value ?

52THANOS commented 1 year ago

can you share the code

3dvisionstudent commented 1 year ago

Hi @gleize. thank you for kind advise. i have been thinking about your advise. and this is my opinion.

Unless it's actively learned, it seems unlikely yes. Have you tried to actively learn the keypoint score from the mask instead ? (i.e. change the binary cross-entropy loss to suppress the keypoint score when positioned on the mask, and revert to the standard behavior when outside the mask).

i set the outside of mask to -1 in order to suppress extracting keypoints. and

    #lib/losses/info_nce/loss.py/corr_matching_binary_cross_entropy function

    # correct matches
    correct_mask_0 = corr_0 == best_idx_0
    correct_mask_1 = corr_1 == best_idx_1

    loss_0 = correct_mask_0 * jax.nn.softplus(-logits_0) + (
        ~correct_mask_0
    ) * jax.nn.softplus(+logits_0)
    loss_1 = correct_mask_1 * jax.nn.softplus(-logits_1) + (
        ~correct_mask_1
    ) * jax.nn.softplus(+logits_1)

this is the loss function. what i did is to set corr_0 and corr_1 GT on outside of mask to -1. the reason why is the logits having -1 GT value should decrease to negative values in order to decrease losses(softplus(x) = -log(sigmoid(-x)), where beta =1). because the logits will convert into probabilities to decide keypoint or not, so i think it makes the same effect to suppress the keypoints outside of mask. doesn't it...?

This second image seems to have a lot of keypoints. Are you selecting the top-k keypoints ? Or are you thresholding the keypoint scores with a constant value ?

i tested what you mentioned. and i found a surprising issue. i observed keypoints increasing the threshold of score. i guessed keypoints on sky and cars have low score after learning with custom strategy. but the result is contrary to expectations, keypoints on sky and cars have much more high scores..., i don't know the exact reason. maybe human error on my code or the way i tried is wrong approach.

this is the result applying score threshold. keypoints on sky have high score value.

3dvisionstudent commented 1 year ago

Hi @52THANOS. thank you for your attention. sure i can share codes. i modified SiLKRandomHomographies class in lib/models/silk.py

def _apply_mask_to_corr(self,corr_forward,corr_backward,warped_masks):
        #label_mask have 255 on available resion and 0 on invalid resion like sky, cars.
        #label mask have same size of cropped image
        #label mask is on original coordinate before warping by homography
        label_mask=warped_masks[0].squeeze(0)
        label_mask=label_mask[9:-9,9:-9]
        label_mask=label_mask.expand(1,-1,-1)
        label_mask=label_mask.reshape(1,-1)

        #transformed_mask have 255 on available resion and 0 on invalid resion like sky, cars.
        #transformed_mask mask have same size of warped image
        #transformed_mask mask is on warped coordinate after warping by homography
        transformed_mask=warped_masks[1].squeeze(0)
        transformed_mask=transformed_mask[9:-9,9:-9]
        transformed_mask=transformed_mask.expand(1,-1,-1)
        transformed_mask=transformed_mask.reshape(1,-1)

        corr_forward[label_mask==0]=-1
        corr_backward[transformed_mask==0]=-1

        #cross check correspondence
        corr_forward, corr_backward = keep_mutual_correspondences_only(
            corr_forward, corr_backward
        )

        return corr_forward, corr_backward

label_mask is original mask, transformed_mask is warped mask by homography, corr_forward is correspondences from label to warped coordinate, and corr_backward is correspondences from warped to label coordinate. i set correspondences GT value to -1 on outside of masking resion. and i wish my model to learn where it should extract keypoints. i added this function in _init_loss_flow function after _get_corr.

gleize commented 1 year ago

Hi @3dvisionstudent,

It's been a while since I've had my head in that code, but I think you're missing something. best_idx_0 will also return -1 when no mutual match has been found (c.f. function asym_keep_mutual_correspondences_only). So if you set the masked part of corr_0 to -1,

correct_mask_0 = corr_0 == best_idx_0

this will consider any keypoint on your mask (i.e. corr_0[i] = -1) that have incorrect mutual matches (i.e. best_idx_0[i] = -1) to be correct. This explains the visual results you obtained where it's actively detecting parts of the sky that are difficult to match (i.e. uniform parts).

Setting corr_0[mask] = -2 instead should work.

3dvisionstudent commented 1 year ago

Hi @gleize. thank you to let me know another way to train SiLK with my custum datasets.

i followed your advice. i set ground truth as -2 on mask.

corr_forward[label_mask==0]=-2
corr_backward[transformed_mask==0]=-2

but the problem extracting keypoints on masking area doesn't disappear. so i decided to train my custom data without mask and use it. i will use the trained model to reconstruct 3D map outside. thank you for helping me use your great work.

picture 1. the result by the model trained with mask.

facebookresearch / silk

Fine tuning SiLK for 360 equirectangular image #29