Training for downstream tasks and Identifier-rich Scene dataset

Chat-3D / Chat-Scene

A multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.

MIT License

81 stars 6 forks source link

Training for downstream tasks and Identifier-rich Scene dataset #27

Closed jkstyle2 closed 4 months ago

jkstyle2 commented 5 months ago

Hello! In the paper, it is said that Chat-3D-v2 can handle various downstream tasks like 3D QA, 3D Visual grounding, 3D Dense Captioning and 3D Scene captioning. Does it mean that the model can handle each task using a same single weight? or should we train the model for each task and save the weights individually?

Also, in the paper, identifier-rich scene captioning dataset is introduced. Is it publicly available or should we create the dataset by ourselves?

Thanks for your reply in advance :)

ZzZZCHS commented 5 months ago

For current code, it requires fine-tuning on each downstream task respectively. I'm working on a refined version, which can handle each task using a same single weight. Hopefully it will be released within next week.

We released the "identifier-rich scene captioning dataset" on google drive before. However, we have abandoned this dataset in recent experiments since its quality is not very good. I'm working on another project which generates a high-quality grounded scene captions. I think it will be released within a month.

jkstyle2 commented 4 months ago

Hello, I found the code is newly updated for v2.1. Does it handle each task using a same single weight, as you mentioned above? It seems the pretrained checkpoint is not updated yet, although the code, annotations and LLM weights are updated. Will you update it soon ?

Regarding the quality of "identifier-rich scene captioning dataset", do you mean the quality of annotations that were created by GPT-4, were poorer than expected? I wonder how could you generate the high-quality grounded scene captions.

ZzZZCHS commented 4 months ago

Yes, it can handle each task using a same single weight now. I've uploaded the pretrained checkpoint few hours ago, but forgot to change the link in the inference guide... You can check it now.

Simply feeding object information (positions / captions) to GPT-4 won't get good results. I think it's hard for GPT-4 to understand the complexity of spatial relationships between these 3D objects. The generated captions usually contain wrong relationships that would harm the model's spatial understanding. To generate captions with higher quality, we constrain GPT-4 to only generate simple relations (like "on top of"), while use some rule-based methods to generate more complex relations such as "in between of". Also, instead of generating captions for a whole scene, we focus on specific regions which contain moderate number of objects. In this way, we are able to control the quality of the generated captions.

jkstyle2 commented 4 months ago

In terms of generating annotations for a whole scene, what do you think of the method introduced in LEO:An Embodied Generalist Agent in 3D World? They leverage scene-graph based annotations and prompt LLMs to produce a total of ~20K captions. I'm not certain of the quality, but it seems reasonable to some extent.

ZzZZCHS commented 4 months ago

I think it is promising to generate captions based on scene graphs. But considering the limited number of annotated scene graphs, it's hard to scale up the dataset. SceneVerse proposed a pipeline to first construct scene graphs automatically and then generate scene captions based on scene graphs. They produce a large number of scene-text pairs. I think these datasets can be used to further enhance the model's captioning ability.

However, both LEO and SceneVerse did not consider the model's grounding ability. We think it is a trend to generate "grounded captions" that each object in the caption is labeled with a bbox or ID so that the model can learn object-caption correspondences from them.