Closed theicfire closed 5 months ago
I ran some comparisons on a larger set of data and found that the single vs batch scores stay very close, within .3% out of the full range of 1 to 5. So my sense is it's not a big deal.
Yes, this is due to the problem of LLM batch inference, https://www.reddit.com/r/LocalLLaMA/comments/19dn2to/inconsistencies_in_llm_outputs_single_vs_batched/, and seems unavoidable right now.
I'm noticing that inference shows different scores depending on the batch size. Here's an example:
The output is this:
I wouldn't expect this to be the case, is there a bug that's mixing data between batch entries?
Note that this depends on the image -- https://raw.githubusercontent.com/Q-Future/Q-Align/main/fig/singapore_flyer.jpg does not have the issue. But most images I'm using have this issue.