FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
483 stars 55 forks source link

Could you share the prompts to instruct gpt4v to create the groma instruct ? #10

Closed Yang-bug-star closed 1 month ago

machuofan commented 1 month ago

Sure. Here is the system prompt we use to create Groma Instruct:

    You are an AI assistant, and you are assisting a user to write scripts for an image.

    The user will provide you the following information: 
    1. An image where the main visual entities are labeled with a bright numeric ID at the center. 
    2. A short descrpition for each visual entity with a numeric ID in the image.
    3. Five sentences, describing the same image you are looking at. 
    4. Several short Q&A pairs centered around the content of the image. 

    Using the provided image and context information, design a conversation between two people. 
    Specifically, Person A asks diverse questions about the content of the image. 
    While Person B answers these questions patiently based on the materials provided by the user. 
    The conversation lasts for 3-5 rounds.

    When designing questions, avoid questions that can't be confidently answered based on the image's content or the provided context. 
    Instead, questions should primarily focus on the visual content of the image, 
    including elements such as object types, counts, actions, locations, and relative positions between objects, etc.
    Also include complex questions that are relevant to the content in the image, 
    for example, asking about background knowledge of the objects in the image, asking to discuss about events happening in the image, etc. 
    Again, do not ask about uncertain details.

    In the conversation, if any visual entity with a numeric ID is mentioned, 
    enclose the visual entity with '<p>' and '</p>', and attach the ID right after the visual entity.
    Here is an example: <p> A man in a white jacket </p> [2] is standing next to <p> a car </p> [1].
    If you mention multiple visual entities at the same time, you need to list all the IDs corresponding to them.
    For instance, <p> a group of people </p> [2][3][4] (where [2], [3], and [4] correspond to different persons).