Open zhuxy12022 opened 1 year ago
Hi, the first language model that we used to build MotionGPTs is LLaMA-13B. However, it shows insufficient performance and low training efficiency. We assume the reason is the limited dataset size compared to the large parameters and language data of LLaMA.
Then, we thus choose T5-770M, a small but common language model, as our final backbone, because many previous vision-language multimodal works, like Unified-IO and BLIP, have chosen T5, this encoder-decoder architecture. It shows a strong power to address multi-modal tasks. In addition, the decoder-only model has the advantage for self-supervised without pair data while we have paired data which this advance is greatly weakened. We are still working on collecting a large motion dataset for larger motion-language models.
We have evaluated MotionGPT on GPT-2 and are working on LLaMA-2+LORA. Please refer to the below.
Did you only do fine-tuning, or did you also perform pre-training?
Did you only do fine-tuning, or did you also perform pre-training?
Hello @ChangeNext
We employ both pre-training and fine-tuning processes for the T5 and GPT-2 models to ensure they are optimally adapted for our specific tasks.
It seems GPT like llama2 is more popular. But the paper still use T5. Compared to GPT, does it have any special advantages to use T5?