ZCMax / LLaVA-3D

A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
147 stars 4 forks source link

Can you provide the training and instruction finetune data format? #5

Open goodstudent9 opened 4 days ago

ZCMax commented 3 days ago

For 2D data, our data format is the same as LLaVA, such as:

  {
    "id": "011e7e629fb9ae7b",
    "image": "textvqa/train_images/011e7e629fb9ae7b.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nProvide a one-sentence caption for the provided image.\nReference OCR token: LESS, IN, LESS, IN, SE, TENSE, ZERO, ALHC, ZERO, ALDL, LESS, INTENSE, $S, INTENSE, COHOL, RO, ALCOHOL, ZERO, ALCOHOL, LISTER, ERINE, ZER, LISTER, ZER, LISTERINE, ZERO, STERINE, ERO, MOUTHW, ProventoKill, MOUTHW, ZERO, MOUTH, Millions, Proven, illMillions, Contact, Germs, MOUTHWAS, ermstha, Cause, Breathon, Proven, Kill, Millions, Germs, CLEANMINT, Breath, Contact, CLEANMINT"
      },
      {
        "from": "gpt",
        "value": "Five Listerine Zero mouthwash bottles on a store shelf."
      }
    ]
  }

For 3D data part, we show some examples here:

  1. For 3D VG data

    {
        "id": 0,
        "video": "scannet/scene0000_00",
        "target": {
            "boxes": [
                [
                    -3.150150775909424,
                    -2.9052600860595703,
                    0.9401445984840393,
                    0.60125732421875,
                    1.2178735733032227,
                    1.9381133317947388
                ]
            ]
        },
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n\"a white cabinet in the corner of the room. in the direction from the door and from the inside . it will be on the left, there is a small brown table on the left side of the cabinet and a smaller table on the right side of the cabinet\" Which object best matches the given description? Please provide its coordinates."
            },
            {
                "from": "gpt",
                "value": "Answer: <boxes>.",
                "boxes_seq": [
                    [
                        0
                    ]
                ]
            }
        ]
    }
  2. 3D QA data

    {
        "id": 0,
        "video": "scannet/scene0000_00",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat is in the right corner of room by curtains? Answer the question using one word or one phrase."
            },
            {
                "from": "gpt",
                "value": "Brown cabinet with tv sitting in it."
            }
        ]
    }
    {
        "id": 734462,
        "video": "3rscan/3rscan0915",
        "target": {
            "boxes": [
                [
                    1.727578392514063,
                    -0.03385602167273519,
                    0.9929349945574621,
                    0.5039399879590645,
                    0.7433579285874168,
                    0.5599999600648886,
                    1.579922432389775,
                    2.1913002243709272e-16,
                    2.5082981892794183e-16
                ]
            ]
        },
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nThe related object is located at <boxes>. What is material of this object?",
                "boxes_seq": [
                    [
                        0
                    ]
                ]
            },
            {
                "from": "gpt",
                "value": "Natural fibers such as willow branches or reeds."
            }
        ]
    }

The main difference between 3D and 2D data is that: 1) we change the key from image to video 2) for 3D coordinatest or bounding boxes information, we use boxes_seq and <boxes> for representation.

goodstudent9 commented 3 days ago

Thank you so much! Let the issue open more time. I will give it a shot!

Best wishes