RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Data type not aligned #39

Open KKKLeon opened 2 months ago

KKKLeon commented 2 months ago

Dear author, during training the data type of img embedding is float16, but the input type for llama is bf16, I am wondering if this misalignment leads to data precision loss during training? Since float16 is much more precise than bf16.

RenShuhuai-Andy commented 2 months ago

Hi, thanks for your interest.

I thinks it's ok to use fp16 for img embedding since it can reduce GPU memory, and it's a common practice for other MLLMs like video-llama. However, it would be possiable to improve performance by using bf16.