Batch inference does not yield consistent results

theicfire commented 5 months ago

I'm noticing that inference shows different scores depending on the batch size. Here's an example:

import requests
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "q-future/one-align", trust_remote_code=True, torch_dtype=torch.float16, device_map="auto"
)

from PIL import Image
image = Image.open(
    requests.get(
        "https://raw.githubusercontent.com/Q-Future/Q-Align/main/fig/structure.png",
        stream=True,
    ).raw
)
image.size

for i in range(1, 3):

    score = model.score(
        [image] * i,
        task_="quality",
        input_="image",
    )  # task_ : quality | aesthetics; # input_: image | video

    print("score with batch size:", i, score.tolist())

The output is this:

score with batch size: 1 [4.04296875]
score with batch size: 2 [4.046875, 4.046875]

I wouldn't expect this to be the case, is there a bug that's mixing data between batch entries?

Note that this depends on the image -- https://raw.githubusercontent.com/Q-Future/Q-Align/main/fig/singapore_flyer.jpg does not have the issue. But most images I'm using have this issue.

theicfire commented 5 months ago

I ran some comparisons on a larger set of data and found that the single vs batch scores stay very close, within .3% out of the full range of 1 to 5. So my sense is it's not a big deal.

teowu commented 5 months ago

Yes, this is due to the problem of LLM batch inference, https://www.reddit.com/r/LocalLLaMA/comments/19dn2to/inconsistencies_in_llm_outputs_single_vs_batched/, and seems unavoidable right now.

Q-Future / Q-Align

Batch inference does not yield consistent results #22