Question about beams size

In your paper, I found the following:

Beam Size k: The beam size k is set to adjust the diversity and range for HALC to search for the best candidate captions. Essentially, the global visual matching score module selects the top k diverse captions from 2m · k text sequence candidates passed from the local adaptive visual grounding module. While a larger k involves a larger search space and hopefully a better generation, the runtime cost also increases linearly with respect to k. HALC adopts Bootstrapping Language-Image Pre-training (BLIP) (Li et al., 2022a) for both text and image encoding when computing their cosine similarity scores. Notably, given the global search capability of our visual matching score module, HALC seeks to preserve a more diverse set of captions within the beam buffer.

The code also confirms that it uses beam_size = num_beams = 1 when using halc decoding method. Could you please explain how the BLIP model functions when this value is set to 1? And how does it ensure global textual fluency?

BillChan226 / HALC

Question about beams size #18