Open3DA / LL3DA

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
https://ll3da.github.io/
MIT License
220 stars 9 forks source link

Qualitative result on scene description task #23

Open jkstyle2 opened 1 month ago

jkstyle2 commented 1 month ago

I've been conducting an evaluation test on unified_3dllm_scene_description dataset with the pretrained generalist ckpt ll3da-opt-1.3b.pth.

An example result for scene0612_00 is as below image

"scene0612_00": {

        "pred": [
            "this room is surrounded by walls, each with a different design. there are two beds in the room, 
            one located near a door and another towards the center. there are also two nightstands in the room. 
            on one of the walls, there is a picture hanging on a hook. the room also includes a lamp and a chair. 
            additionally, there are two curtains on one of those walls. additionally, on the floor, there is another picture hanging. 
            on the opposite side, there is food standing beside a table. additionally, a wardrobe dresser is present in the room as well. 
            there is also a door and a window in the room.. the room features a nightlight on another wall."
        ],

        "gt": [
            "this room has a vent located near the center. there is a shower curtain hanging from a rod, 
            along with shower walls surrounding the shower area. a towel is hanging on a towel rack near the shower. 
            the room also contains a wall, a floor, a toilet, and a soap dish. three more towels are scattered throughout the room. 
            there is a toilet paper holder mounted on a wall. a picture is hanging on another wall. the room has a ceiling, 
            and there are two more walls. a light switch is located near the entrance, and there is a doorframe leading to the room. 
            finally, there is a bar attached to the wall and a bathtub next to it.",

            "this room is a bathroom with a vent situated at one corner. in the center of the room, 
            there is a bathtub with a shower curtain and a shower curtain rod attached to the walls. 
            the walls are positioned around the room, enclosing the space. 
            there is a door with a doorframe located at one side of the room, opposite the bathtub. 
            adjacent to the door, there is a light switch on one of the walls. 
            the ceiling is above, covering the entire room. on the opposite side of the bathtub, 
            there is a towel hanging from a bar mounted on the wall. in the corner next to the towel, there is a soap dish. 
            near the soap dish, there is a towel. 
            in another corner of the room, there is a towel and a towel holder. 
            additionally, there is a toilet in the room with a toilet paper holder attached to one side. 
            finally, there is a picture hanging on the wall, completing the decor of the room."
        ],

        "score": {
            "bleu-1": 0.6134234738542225,
            "bleu-2": 0.44446629419655276,
            "bleu-3": 0.2870047490209411,
            "bleu-4": 0.16898075047368497,
            "CiDEr": 0.1474318082229148,
            "rouge": 0.3330959164292497,
            "meteor": 0.18474151255536048
        }

    },

As it shown, the pred result is quite different from gt annotations. Is this an expected result for the generalist model or am I missing something to do? I wonder if it could be much better with fine-tuning on unified_3dllm_scene_description dataset.

I also noticed the annoations for scene description dataset are quite different from the original 3D-LLM annotations. For instance, the annotation for the above scene0612_00 is :

["The room is a bathroom with a shower curtain, sink, paper towel dispenser, crate, toilet, soap dispenser, 
bathroom counter, toilet paper, and trash bins. The walls are present, as well as a mirror and a door. 
There is also a light switch and a bar. The room has a window."]},

Could you tell me why are they different and how did you process each annoation for the scene description task?

ch3cook-fdu commented 1 month ago

When we are doing this paper, we filter and gather the v1.0 version of 3D_LLM dataset. Since the authors of 3D_LLM have updated their annotations on ScanNet, maybe you should use their updated dataset for better performance.

Since we are only training on limited data, there will be hallucinations.

jkstyle2 commented 1 month ago

Thanks for the information.

Can I enhance the model for scene captioning task by fine-tuning only on unified_3dllm_scene_description? I understand the limited training data, but it seems much better in the qualitative result from the paper below. image image

I might try leveraging LL3DA generalist using some other scene captioning dataset like LEO and SceneVerse released.

ch3cook-fdu commented 1 month ago

Yes, training on the latter datasets might help.

jkstyle2 commented 1 month ago

Can you tell me which learning model the above qualitative results are from? Is it fine-tuned on the dataset unified_3dllm_scene_description?

I'm trying to fine tune the model, but there is no script for it. Referring tuning.scanqa.sh and modified as below. Am I doing correct?

python main.py \
    --use_color --use_normal \
    --detector detector_Vote2Cap_DETR \
    --captioner ll3da \
    --pretrained_weights ./ckpts/opt-1.3b/ll3da-generalist/checkpoint.pth \
    --warm_lr_epochs 0 \
    --dataset unified_3dllm_scene_description \
    --vocab facebook/opt-1.3b \
    --qformer_vocab bert-base-embedding \
    --checkpoint_dir ./ckpts/opt-1.3b/ll3da-scene-description-tuned \
    --max_epoch 24 \
    --dist_url tcp://localhost:222 \
    --eval_every_iteration 4000 \
    --start_eval_after -1 \
    --save_every 10000 \
    --criterion 'CiDEr' \
    --freeze_detector --freeze_llm \
    --batchsize_per_gpu 8 --ngpus 2 --base_lr 1e-6 --final_lr 1e-6 \
    --max_des_len 512 \
    --max_prompt 1 --use_beam_search
jkstyle2 commented 1 month ago

@ch3cook-fdu I think the overall annotation quality of 3dllm-scene-description is not very good and it would be one of the main reasons for performance degradation. The annotation has already a various of hallucinations in it. For instance, the gt annotation for the scene below is image

"scene0001_00": {
        "pred": [
            "this room has a floor located in the central area. there is a whiteboard mounted on one side of the room. 
    on the other side of the floor, there are tables, a lamp, a monitor, and an office chair. in addition, there are two cabinets, 
    an office chair, and a lamp. in addition to the furniture, there is a window, a whiteboard, and a ceiling in the room."
        ],
        "gt": [
            "this room is a cozy space with a bed placed at the center. on one side of the bed, there is a nightstand, 
    and on the other side, there is a stool. adjacent to the stool, there is a cabinet for storing items. above the bed,
    there is a picture hanging on the wall. there is also a sign and a curtain in the room. a clothing item is placed on 
    a ledge next to the bed. the floor is carpeted and there is a purse on it. on the other side of the room, there is 
    a lamp and another sign. the room is illuminated by multiple pillows placed on the bed. there is a cart and an organizer 
    in the room as well. the door to exit the room is located next to the cart. there is another set of clothing items and a storage bin near the wall.",

            "in this room, there is a curtain hanging from the ceiling. next to the curtain is a nightstand. 
    there is also a stool nearby. against one wall, there is a cabinet and a bed. on another wall, there is a sign. 
    there is some clothing on a ledge. on the floor, there is a purse. throughout the room, there are various 
    pillows and a cart. on one wall, there is a lamp and more clothing. on another wall, there is an organizer. 
    near the entrance, there is a door. finally, there are storage bins scattered around the room."
        ],

Could you provide the code you used for parsing annotation from 3d-llm-scene-description v1.0? I want to try with the updated annotation v2.0 for ll3da.

jkstyle2 commented 1 month ago

I found it here. Thanks for your work!

jkstyle2 commented 1 month ago

@ch3cook-fdu I firstly trained the model for the generalist, and finetuned it only on unified_3dllm_scene_description . The evaluation result from the generalist for the scene_description task is

[BLEU-1] Mean: 0.4812, Max: 0.6358, Min: 0.0532
[BLEU-2] Mean: 0.3096, Max: 0.4618, Min: 0.0340
[BLEU-3] Mean: 0.1895, Max: 0.3342, Min: 0.0220
[BLEU-4] Mean: 0.1125, Max: 0.2350, Min: 0.0000
[CIDEr] Mean: 0.0149, Max: 0.2575, Min: 0.0000
[ROUGE-L] Mean: 0.2683, Max: 0.3709, Min: 0.1685
[METEOR] Mean: 0.1640, Max: 0.2576, Min: 0.0680

The evaluation result after finetuning for the scene_description task is

[BLEU-1] Mean: 0.3740, Max: 0.6601, Min: 0.0046
[BLEU-2] Mean: 0.2268, Max: 0.4111, Min: 0.0026
[BLEU-3] Mean: 0.1330, Max: 0.3123, Min: 0.0000
[BLEU-4] Mean: 0.0759, Max: 0.2412, Min: 0.0000
[CIDEr] Mean: 0.0298, Max: 0.5861, Min: 0.0000
[ROUGE-L] Mean: 0.2426, Max: 0.4206, Min: 0.1162
[METEOR] Mean: 0.1498, Max: 0.2973, Min: 0.0445

I set "CiDEr" as criterion while fine-tuning, so CIDEr gets slightly improved. But the others got much lower. Is this normal or am I doing something wrong? Can you please give me an advice how to improve the performance for scene description task? I'm thinking of modifying opt1.3b to opt6.7b and train it from the scratch. Do you think it helpful?

ch3cook-fdu commented 1 month ago

This is normal. Please dig into the definition of these metrics:

  1. BLEU-4 measures the precision of 4-grams.
  2. CiDEr measures the TF-IDF of 4-grams.
  3. Rouge-L measures the longest common sequence.
  4. Meteor is a more complex metric that takes the synonym into consideration.
jkstyle2 commented 1 month ago

Which metric would you recommend to improve for scene description task? I want to improve the accuracy of specific task, saying scene description. How would you recommend for training your model? For instance, I would first train the model for the generalist. Then fine-tune on the scene description data. I might collect some data relevant to scene description and also change the LLM backbone to opt6.4b.

Can you suggest some other approaches for it?

ch3cook-fdu commented 1 month ago
  1. I think none of the traditional metrics (BLEU, CiDEr, METEOR, BERT Score, Rouge-L) are good enough. They can not effiicently align with human preference.
  2. And yes, with proper pre-training and higher quality data collection, you should expect better results. While larger backbones might not always lead to better results.
jkstyle2 commented 1 month ago

Thanks for your advice! I'm looking through the dataset made from 3d_llm for v1, v2, v3. I found that the quality of annotations is v1 << v2 < v3, so I'd like to train the model for each dataset and compare their performance. Can you tell me how did you preprocess (split and filter) each dataset for 3d_llm_embodied_dialougue_filtered_train/val.json, 3d_llm_embodied_planning_filtered_train/val.json, 3d_llm_embodied_question_answer_filtered_train/val.json, 3d_llm_scene_description_train/val.json ? Does each dataset share same file list for train/val split ?

ch3cook-fdu commented 1 month ago

We split the task data manually, and treat the scene ids larger than 600 as the validation set.

jkstyle2 commented 1 month ago

Thanks for the information. I've created 4 different types of annotation files : 3d_llm_embodied_dialougue_filtered_train/val.json, 3d_llm_embodied_planning_filtered_train/val.json, 3d_llm_embodied_question_answer_filtered_train/val.json, 3d_llm_scene_description_train/val.json.

However in the default script train.generalist.sh , there is no argument for the dataset 3d_llm_embodied_question_answer_filtered_train/val.json. Is this intended because it is similar to ScanQA annotation?