For 2D data, our data format is the same as LLaVA, such as:
{
"id": "011e7e629fb9ae7b",
"image": "textvqa/train_images/011e7e629fb9ae7b.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nProvide a one-sentence caption for the provided image.\nReference OCR token: LESS, IN, LESS, IN, SE, TENSE, ZERO, ALHC, ZERO, ALDL, LESS, INTENSE, $S, INTENSE, COHOL, RO, ALCOHOL, ZERO, ALCOHOL, LISTER, ERINE, ZER, LISTER, ZER, LISTERINE, ZERO, STERINE, ERO, MOUTHW, ProventoKill, MOUTHW, ZERO, MOUTH, Millions, Proven, illMillions, Contact, Germs, MOUTHWAS, ermstha, Cause, Breathon, Proven, Kill, Millions, Germs, CLEANMINT, Breath, Contact, CLEANMINT"
},
{
"from": "gpt",
"value": "Five Listerine Zero mouthwash bottles on a store shelf."
}
]
}
For 3D data part, we show some examples here:
For 3D VG data
{
"id": 0,
"video": "scannet/scene0000_00",
"target": {
"boxes": [
[
-3.150150775909424,
-2.9052600860595703,
0.9401445984840393,
0.60125732421875,
1.2178735733032227,
1.9381133317947388
]
]
},
"conversations": [
{
"from": "human",
"value": "<video>\n\"a white cabinet in the corner of the room. in the direction from the door and from the inside . it will be on the left, there is a small brown table on the left side of the cabinet and a smaller table on the right side of the cabinet\" Which object best matches the given description? Please provide its coordinates."
},
{
"from": "gpt",
"value": "Answer: <boxes>.",
"boxes_seq": [
[
0
]
]
}
]
}
3D QA data
{
"id": 0,
"video": "scannet/scene0000_00",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is in the right corner of room by curtains? Answer the question using one word or one phrase."
},
{
"from": "gpt",
"value": "Brown cabinet with tv sitting in it."
}
]
}
{
"id": 734462,
"video": "3rscan/3rscan0915",
"target": {
"boxes": [
[
1.727578392514063,
-0.03385602167273519,
0.9929349945574621,
0.5039399879590645,
0.7433579285874168,
0.5599999600648886,
1.579922432389775,
2.1913002243709272e-16,
2.5082981892794183e-16
]
]
},
"conversations": [
{
"from": "human",
"value": "<video>\nThe related object is located at <boxes>. What is material of this object?",
"boxes_seq": [
[
0
]
]
},
{
"from": "gpt",
"value": "Natural fibers such as willow branches or reeds."
}
]
}
The main difference between 3D and 2D data is that: 1) we change the key from image to video 2) for 3D coordinatest or bounding boxes information, we use boxes_seq and <boxes> for representation.
For 2D data, our data format is the same as LLaVA, such as:
For 3D data part, we show some examples here:
For 3D VG data
3D QA data
The main difference between 3D and 2D data is that: 1) we change the key from
image
tovideo
2) for 3D coordinatest or bounding boxes information, we useboxes_seq
and<boxes>
for representation.