GAIR-NLP / anole

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
https://huggingface.co/spaces/ethanchern/Anole
618 stars 33 forks source link

question about the image understanding #25

Open df2046df opened 1 month ago

df2046df commented 1 month ago

Does this model support multiple image inputs?

JoyBoy-Su commented 1 month ago

Hi, thanks for your interest! Anole can support multiple input images. You can do this by adjusting the structure of input.json and refer to the instruction to run. Here's an example:

[
    {
        "type": "image",
        "content": "image1.png"
    },
    {
        "type": "image",
        "content": "image2.png"
    },
    {
        "type": "text",
        "content": "your instruction"
    }
]

And it's important to note that the performance of Anole depends on the multiple image input task, and Anole may perform differently on different tasks.

df2046df commented 1 month ago

Hi, thanks for your interest! Anole can support multiple input images. You can do this by adjusting the structure of input.json and refer to the instruction to run. Here's an example:

[
    {
        "type": "image",
        "content": "image1.png"
    },
    {
        "type": "image",
        "content": "image2.png"
    },
    {
        "type": "text",
        "content": "your instruction"
    }
]

And it's important to note that the performance of Anole depends on the multiple image input task, and Anole may perform differently on different tasks.

Thank you for your reply! But I have a problem when inputting multiple images: when the number of input images is greater than or equal to four, the following error will occur:

Traceback (most recent call last): File "/opt/data/private/code/anole/inference.py", line 133, in main(args) File "/opt/data/private/code/anole/inference.py", line 107, in main segments = split_token_sequence(tokens, boi, eoi) File "/opt/data/private/code/anole/inference.py", line 32, in split_token_sequence batchsize, = tokens.shape ValueError: not enough values to unpack (expected 2, got 1)

I output the shape of tokens and found that the result is torch.Size([0]). What is the reason for this?

JoyBoy-Su commented 1 month ago

Hi, thanks for your interest! Anole can support multiple input images. You can do this by adjusting the structure of input.json and refer to the instruction to run. Here's an example:

[
    {
        "type": "image",
        "content": "image1.png"
    },
    {
        "type": "image",
        "content": "image2.png"
    },
    {
        "type": "text",
        "content": "your instruction"
    }
]

And it's important to note that the performance of Anole depends on the multiple image input task, and Anole may perform differently on different tasks.

Thank you for your reply! But I have a problem when inputting multiple images: when the number of input images is greater than or equal to four, the following error will occur:

Traceback (most recent call last): File "/opt/data/private/code/anole/inference.py", line 133, in main(args) File "/opt/data/private/code/anole/inference.py", line 107, in main segments = split_token_sequence(tokens, boi, eoi) File "/opt/data/private/code/anole/inference.py", line 32, in split_token_sequence batchsize, = tokens.shape ValueError: not enough values to unpack (expected 2, got 1)

I output the shape of tokens and found that the result is torch.Size([0]). What is the reason for this?

Probably because the default Anole context length is 4096 and the number of tokens per image is 1026 (1024 + boi + eoi), which makes the model not work properly when the number of input images is greater than or equal to 4.

YiFang99 commented 1 month ago

Is the number of tokens per image a parameter that user can set or is it fixed?

JoyBoy-Su commented 1 month ago

Is the number of tokens per image a parameter that user can set or is it fixed?

I'm sorry it's fixed.

df2046df commented 1 month ago

Hi, thanks for your interest! Anole can support multiple input images. You can do this by adjusting the structure of input.json and refer to the instruction to run. Here's an example:

[
    {
        "type": "image",
        "content": "image1.png"
    },
    {
        "type": "image",
        "content": "image2.png"
    },
    {
        "type": "text",
        "content": "your instruction"
    }
]

And it's important to note that the performance of Anole depends on the multiple image input task, and Anole may perform differently on different tasks.

Thank you for your reply! But I have a problem when inputting multiple images: when the number of input images is greater than or equal to four, the following error will occur: Traceback (most recent call last): File "/opt/data/private/code/anole/inference.py", line 133, in main(args) File "/opt/data/private/code/anole/inference.py", line 107, in main segments = split_token_sequence(tokens, boi, eoi) File "/opt/data/private/code/anole/inference.py", line 32, in split_token_sequence batchsize, = tokens.shape ValueError: not enough values to unpack (expected 2, got 1) I output the shape of tokens and found that the result is torch.Size([0]). What is the reason for this?

Probably because the default Anole context length is 4096 and the number of tokens per image is 1026 (1024 + boi + eoi), which makes the model not work properly when the number of input images is greater than or equal to 4.

I have another question. When I use the model for batch image understanding, the output is empty. Snipaste_2024-07-18_17-02-18 What could be the reason for this?