Closed Chunmian-art closed 7 months ago
The "stage 3" in code was inherited from Chat-3D v1, which was used for instruction tuning on multi-turn conversations. In Chat-3D v2, we did not conduct experiments on multi-turn conversations, so you can just ignore the "stage 3" code.
For finetuning with llama activated, we simply activate the last few transformer layers for tuning, which you can refer to this code snippet. You can try the difference between activating and deactivating llama. Maybe it's better to finetune it with LoRA. In my experience, the performance does not improve with llama activated for now. I think it's because the amount of data for 3D-scene pairs is far from enough to achieve good alignment before tuning llama.
In the paper, the authors say the model is finetuned end to end in the stage three.
But the code provide the stage 2 and stage 2 is not finetuned with the llama activated.
Could you provide the instructions on how to use the code in the stage 3?