Can you provide the training and instruction finetune data format?

For 2D data, our data format is the same as LLaVA, such as:

  {
    "id": "011e7e629fb9ae7b",
    "image": "textvqa/train_images/011e7e629fb9ae7b.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nProvide a one-sentence caption for the provided image.\nReference OCR token: LESS, IN, LESS, IN, SE, TENSE, ZERO, ALHC, ZERO, ALDL, LESS, INTENSE, $S, INTENSE, COHOL, RO, ALCOHOL, ZERO, ALCOHOL, LISTER, ERINE, ZER, LISTER, ZER, LISTERINE, ZERO, STERINE, ERO, MOUTHW, ProventoKill, MOUTHW, ZERO, MOUTH, Millions, Proven, illMillions, Contact, Germs, MOUTHWAS, ermstha, Cause, Breathon, Proven, Kill, Millions, Germs, CLEANMINT, Breath, Contact, CLEANMINT"
      },
      {
        "from": "gpt",
        "value": "Five Listerine Zero mouthwash bottles on a store shelf."
      }
    ]
  }

For 3D data part, we show some examples here:

For 3D VG data

{
    "id": 0,
    "video": "scannet/scene0000_00",
    "target": {
        "boxes": [
            [
                -3.150150775909424,
                -2.9052600860595703,
                0.9401445984840393,
                0.60125732421875,
                1.2178735733032227,
                1.9381133317947388
            ]
        ]
    },
    "conversations": [
        {
            "from": "human",
            "value": "<video>\n\"a white cabinet in the corner of the room. in the direction from the door and from the inside . it will be on the left, there is a small brown table on the left side of the cabinet and a smaller table on the right side of the cabinet\" Which object best matches the given description? Please provide its coordinates."
        },
        {
            "from": "gpt",
            "value": "Answer: <boxes>.",
            "boxes_seq": [
                [
                    0
                ]
            ]
        }
    ]
}

3D QA data

    {
        "id": 0,
        "video": "scannet/scene0000_00",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat is in the right corner of room by curtains? Answer the question using one word or one phrase."
            },
            {
                "from": "gpt",
                "value": "Brown cabinet with tv sitting in it."
            }
        ]
    }

    {
        "id": 734462,
        "video": "3rscan/3rscan0915",
        "target": {
            "boxes": [
                [
                    1.727578392514063,
                    -0.03385602167273519,
                    0.9929349945574621,
                    0.5039399879590645,
                    0.7433579285874168,
                    0.5599999600648886,
                    1.579922432389775,
                    2.1913002243709272e-16,
                    2.5082981892794183e-16
                ]
            ]
        },
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe related object is located at <boxes>. What is material of this object?",
                "boxes_seq": [
                    [
                        0
                    ]
                ]
            },
            {
                "from": "gpt",
                "value": "Natural fibers such as willow branches or reeds."
            }
        ]
    }

The main difference between 3D and 2D data is that: 1) we change the key from image to video 2) for 3D coordinatest or bounding boxes information, we use boxes_seq and <boxes> for representation.

ZCMax / LLaVA-3D

Can you provide the training and instruction finetune data format? #5