ZCMax / LLaVA-3D

A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
174 stars 5 forks source link

JSON format for alignment phase #16

Open xjj1999 opened 2 weeks ago

xjj1999 commented 2 weeks ago

hello, can you provide the training format example for the alignment phase? I have collected the scene caption from SceneVerse, associated camera parameters and images. Thank you again for doing such a great job!

ZCMax commented 2 weeks ago

Of course, here is a sample for scene caption data:

    {
        "id": 0,
        "video": "frames/scannet/scene0442_00",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nDescribe the room concisely."
            },
            {
                "from": "gpt",
                "value": "In the opulent living room, adorned with four chairs, four tables, and five armchairs, a symphony of elegance unfolds. The chairs, positioned in front of the tables, create an inviting space for conversation and relaxation. The tables, in turn, stand proudly behind the chairs, offering a surface for books, drinks, or cherished mementos. The armchairs, scattered throughout the room, beckon weary souls to sink into their plush embrace. This living room exudes comfort and sophistication, a sanctuary for both solitary contemplation and convivial gatherings."
            }
        ]
    }
xjj1999 commented 2 weeks ago

Thanks!Now I've managed to run through the training script. One more detail to confirm is whether the 3d scene for the first stage of training contains only 3RScan,scannet and matterport3D and not scenes such as ARKitScenes. This is because I noticed that the provided camera parameter file only contains these three types of 3D data.

ZCMax commented 2 weeks ago

Yes, we did not use the ARKitScenes datasets during the training stage.

xjj1999 commented 2 weeks ago

Hi, I also noticed that the matterport3D dataset in the camera parameter json has been divided into folders by region, what should I do with the raw matterport3d data?

xjj1999 commented 2 weeks ago

Hi, I printed the names of the parameters that can be learned in the pre-training phase and found that only mm_project is involved in the training. The paper states “We freeze the vision encoder and LLM parameters, and only train the projection layer and 3D position embedding layer, encouraging efficient alignment between 3D patch features and text space. encouraging efficient alignment between 3D patch features and text space.” Does it mean that the video tower should be involved in the training?

ZCMax commented 2 weeks ago

Actually yes, I think it's a bug in the current code on github. Thanks for your reminder~ We will update our code and release more data and documents after CVPR ddl~

xjj1999 commented 2 weeks ago

Tune_video_tower is set to False by default, changing it to True should fix the problem. Thanks for your reply, I'm looking forward to reproducing your excellent work on the full dataset!

xjj1999 commented 1 week ago

Hi, I have completed the align training stage. The obtained model shows some scene caption capability on openscan data. I'd like to try the second trainning phase. Can I ask if there is any expected progress about 3D VG module?

ZCMax commented 1 week ago

Hello, we have updated the grounding module architecture recently with simpler architecture and higher performance for CVPR submission. We'll release the related code after the CVPR supplementary deadline. Stay tuned!