BillChan226 / HALC

[ICML 2024] Official implementation for "HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding"
https://billchan226.github.io/HALC
MIT License
70 stars 2 forks source link

Question about beams size #18

Open pspdada opened 1 day ago

pspdada commented 1 day ago

In your paper, I found the following:

Beam Size k: The beam size k is set to adjust the diversity and range for HALC to search for the best candidate captions. Essentially, the global visual matching score module selects the top k diverse captions from 2m · k text sequence candidates passed from the local adaptive visual grounding module. While a larger k involves a larger search space and hopefully a better generation, the runtime cost also increases linearly with respect to k. HALC adopts Bootstrapping Language-Image Pre-training (BLIP) (Li et al., 2022a) for both text and image encoding when computing their cosine similarity scores. Notably, given the global search capability of our visual matching score module, HALC seeks to preserve a more diverse set of captions within the beam buffer.

The code also confirms that it uses beam_size = num_beams = 1 when using halc decoding method. Could you please explain how the BLIP model functions when this value is set to 1? And how does it ensure global textual fluency?

BillChan226 commented 21 hours ago

Sure, thanks for the question! Even though we set beam_size to 1 in our experiment for more efficient decoding, beam_size=1 can also be effective in searching for an optimal candidate token among the 2m tokens that we get from the bidirectional pair-wise contrasted logits. By finding the token that maximizes the image-text alignment via BLIP from the 2m candidate tokens, we can ensure global textual fluency.