CrossmodalGroup / LAPS

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment, CVPR, 2024
69 stars 6 forks source link

The inference about LAPS #1

Open XpracticeYSKM opened 2 months ago

XpracticeYSKM commented 2 months ago

Thanks for your awesome work!

I wonder the patch selection and calibration will be conducted during inference?

In other words,the inference is consistent with training process where we need to apply LAPS for vision encoder and split sentence into textual words and then conduct patch-word alignment rather than image-sentence alignment?

darkpromise98 commented 2 months ago

Thanks for your attention, I am happy to answer your questions.

Question 1: Yes, the patch selection and calibration will are conducted in the inference stage.

Question 2: Yes, the inference process is consistent with training process, where we need to compute the patch-word alignment rather than image-sentence alignment. Since LAPS is a fine-grained alignment framework and different from the previous coarse-grained work (e.g., CLIP and series work).

Besides, you could use LAPS to compute patch-word alignment at training stage, and then compute the image-sentence alignment in the inference stage. In this way, the performance will drop slightly, but the computational efficiency will be greatly improved.