dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

Logic error in code: img_in_text and img_token not in sentence["value"] #50

Closed dragen1860 closed 5 months ago

dragen1860 commented 5 months ago
image

the code

            if img_token in sentence["value"]:
                img_in_text = True
            # add image token to all sentence if multimoal input
            if role == conv.roles[0] and img_token in sentence["value"] and img_token not in sentence["value"]:

img_token in sentence["value"] and img_token not in sentence["value"] has logic error. I guess maybe some typos?

yanwei-li commented 5 months ago

Hi, it's not an error. This is used to convert LLaVA-based conversation to our format. In LLaVA format only one <image> token is given at the beginning of the multi-turn conversation. That means img_in_text set to true if img_token appear in this multi-turn conversation. We attach this token img_token at each sub-conversation if img_token not exist in other turns. You can try to debug step-by-step to find the logic.

dragen1860 commented 5 months ago

but i still think the logic if will never triggered. please check it mg_token in sentence["value"] and img_token not in sentence["value"] twice. @yanwei-li

valencebond commented 5 months ago

Hi, it's not an error. This is used to convert LLaVA-based conversation to our format. In LLaVA format only one <image> token is given at the beginning of the multi-turn conversation. That means img_in_text set to true if img_token appear in this multi-turn conversation. We attach this token img_token at each sub-conversation if img_token not exist in other turns. You can try to debug step-by-step to find the logic.

Does this way improve performance?