The inference about LAPS

CrossmodalGroup / LAPS

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment, CVPR, 2024

82 stars 8 forks source link

Thanks for your attention, I am happy to answer your questions.

Question 1: Yes, the patch selection and calibration will are conducted in the inference stage.

Question 2: Yes, the inference process is consistent with training process, where we need to compute the patch-word alignment rather than image-sentence alignment. Since LAPS is a fine-grained alignment framework and different from the previous coarse-grained work (e.g., CLIP and series work).

Besides, you could use LAPS to compute patch-word alignment at training stage, and then compute the image-sentence alignment in the inference stage. In this way, the performance will drop slightly, but the computational efficiency will be greatly improved.

CrossmodalGroup / LAPS

The inference about LAPS #1