ZzZZCHS / Chat-Scene

Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)
MIT License
113 stars 8 forks source link

About stage 3 #20

Closed Chunmian-art closed 7 months ago

Chunmian-art commented 8 months ago

In the paper, the authors say the model is finetuned end to end in the stage three.

But the code provide the stage 2 and stage 2 is not finetuned with the llama activated.

Could you provide the instructions on how to use the code in the stage 3?

ZzZZCHS commented 8 months ago

The "stage 3" in code was inherited from Chat-3D v1, which was used for instruction tuning on multi-turn conversations. In Chat-3D v2, we did not conduct experiments on multi-turn conversations, so you can just ignore the "stage 3" code.

For finetuning with llama activated, we simply activate the last few transformer layers for tuning, which you can refer to this code snippet. You can try the difference between activating and deactivating llama. Maybe it's better to finetune it with LoRA. In my experience, the performance does not improve with llama activated for now. I think it's because the amount of data for 3D-scene pairs is far from enough to achieve good alignment before tuning llama.