Open pspdada opened 1 day ago
Sure, thanks for the question! Even though we set beam_size
to 1 in our experiment for more efficient decoding, beam_size=1
can also be effective in searching for an optimal candidate token among the 2m
tokens that we get from the bidirectional pair-wise contrasted logits. By finding the token that maximizes the image-text alignment via BLIP from the 2m
candidate tokens, we can ensure global textual fluency.
In your paper, I found the following:
The code also confirms that it uses
beam_size = num_beams = 1
when usinghalc
decoding method. Could you please explain how the BLIP model functions when this value is set to 1? And how does it ensure global textual fluency?