Open XpracticeYSKM opened 4 months ago
Thanks for your attention, I am happy to answer your questions.
Question 1: Yes, the patch selection and calibration will are conducted in the inference stage.
Question 2: Yes, the inference process is consistent with training process, where we need to compute the patch-word alignment rather than image-sentence alignment. Since LAPS is a fine-grained alignment framework and different from the previous coarse-grained work (e.g., CLIP and series work).
Besides, you could use LAPS to compute patch-word alignment at training stage, and then compute the image-sentence alignment in the inference stage. In this way, the performance will drop slightly, but the computational efficiency will be greatly improved.
Thanks for your awesome work!
I wonder the patch selection and calibration will be conducted during inference?
In other words,the inference is consistent with training process where we need to apply LAPS for vision encoder and split sentence into textual words and then conduct patch-word alignment rather than image-sentence alignment?