beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
445 stars 22 forks source link

<image> tag in json file #28

Closed Quest2GM closed 1 month ago

Quest2GM commented 1 month ago

Hi

I am looking to create my own dataset to finetune this model. In the share-captioner_coco_lcs_sam_1246k_1107.json file, I noticed that there were these "image" tags at the beginning or at the end of the question? Are these image tags important? If so, how do you know when it should be placed before the prompt or after the prompt?

Like this:

{
    "id": "sa_201161",
    "image": "sam/images/sa_201161.jpg",
    "conversations": [
        {
            "from": "human",
            "value": "**<image>**\nCan you describe the main features of this image for me?"
        },
        {
            "from": "gpt",
            "value": "The image captures a scene in a desert-like setting. Dominating the frame is a tall, rusted metal structure, reminiscent of a lighthouse or a windmill. The structure is composed of a series of metal rings and poles, giving it a skeletal appearance. At the top of the structure, a red metal rooster is perched, adding a touch of color to the otherwise monochrome scene. The rooster is facing towards the right side of the image, as if gazing out into the distance.\n\nThe structure is situated on a dirt path, surrounded by a scattering of rocks and shrubs. These elements of nature contrast with the man-made structure, creating a blend of the natural and the artificial. The sky above is a clear blue, devoid of any clouds, suggesting a bright, sunny day.\n\nDespite the detailed description, the image does not contain any discernible text or specific actions being performed by objects. The relative positions of the objects remain constant, with the rooster at the top of the structure, the rocks and shrubs surrounding the path, and the blue sky above. The image is a still life, with no indication of movement or change. It's a snapshot of a moment, frozen in time."
        }
    ]
},
{
    "id": "sa_463802",
    "image": "sam/images/sa_463802.jpg",
    "conversations": [
        {
            "from": "human",
            "value": "Describe the following image.\n**<image>**"
        },
        {
            "from": "gpt",
            "value": "The image captures a lively scene in a park. Two individuals are the main focus of the image, both dressed in vibrant tiger costumes. The person on the left is pushing a stroller, while the person on the right is holding a stuffed tiger toy. They are walking on a concrete path, surrounded by other park-goers who are seated on the grass. The park is adorned with trees and a fountain, adding to the serene atmosphere. The photo is taken from a low angle, giving a unique perspective of the scene. The colors in the image are vivid, with the tiger costumes standing out against the more muted tones of the park. The image does not provide any specific information about the landmark sa_17489."
        }
    ]
},

Thanks for the great work, and would appreciate a quick response.

beichenzbc commented 1 month ago

We only use the caption (i.e. "from": "gpt" The image captures a scene in a desert-like setting ...) in training. If you want to also use the questions, you may ask the author of ShareGPT4V for further details.

Quest2GM commented 1 month ago

I was referring to the image tag in this line here:

"value": " \<image>\nCan you describe the main features of this image for me?"

Why is it this \<image> used here? sometimes it is placed before the question sometimes placed after the question

Quest2GM commented 1 month ago

Oh sorry, I misunderstood what you said. I see that you said that you don't use the question. Thanks for the quick response.