Open xjj1999 opened 2 weeks ago
Of course, here is a sample for scene caption data:
{
"id": 0,
"video": "frames/scannet/scene0442_00",
"conversations": [
{
"from": "human",
"value": "<video>\nDescribe the room concisely."
},
{
"from": "gpt",
"value": "In the opulent living room, adorned with four chairs, four tables, and five armchairs, a symphony of elegance unfolds. The chairs, positioned in front of the tables, create an inviting space for conversation and relaxation. The tables, in turn, stand proudly behind the chairs, offering a surface for books, drinks, or cherished mementos. The armchairs, scattered throughout the room, beckon weary souls to sink into their plush embrace. This living room exudes comfort and sophistication, a sanctuary for both solitary contemplation and convivial gatherings."
}
]
}
Thanks!Now I've managed to run through the training script. One more detail to confirm is whether the 3d scene for the first stage of training contains only 3RScan,scannet and matterport3D and not scenes such as ARKitScenes. This is because I noticed that the provided camera parameter file only contains these three types of 3D data.
Yes, we did not use the ARKitScenes datasets during the training stage.
Hi, I also noticed that the matterport3D dataset in the camera parameter json has been divided into folders by region, what should I do with the raw matterport3d data?
Hi, I printed the names of the parameters that can be learned in the pre-training phase and found that only mm_project is involved in the training. The paper states “We freeze the vision encoder and LLM parameters, and only train the projection layer and 3D position embedding layer, encouraging efficient alignment between 3D patch features and text space. encouraging efficient alignment between 3D patch features and text space.” Does it mean that the video tower should be involved in the training?
Actually yes, I think it's a bug in the current code on github. Thanks for your reminder~ We will update our code and release more data and documents after CVPR ddl~
Tune_video_tower is set to False by default, changing it to True should fix the problem. Thanks for your reply, I'm looking forward to reproducing your excellent work on the full dataset!
Hi, I have completed the align training stage. The obtained model shows some scene caption capability on openscan data. I'd like to try the second trainning phase. Can I ask if there is any expected progress about 3D VG module?
Hello, we have updated the grounding module architecture recently with simpler architecture and higher performance for CVPR submission. We'll release the related code after the CVPR supplementary deadline. Stay tuned!
hello, can you provide the training format example for the alignment phase? I have collected the scene caption from SceneVerse, associated camera parameters and images. Thank you again for doing such a great job!