Question about training stage3 in videochat2

bexxnaz commented 3 months ago

Hello! First of all, thank you for your great work on the videochat2 model.

I have a question about the training part in stage3, particularly in line 274 of the videochat2_it.py file. In that line (), it seems that the final target includes the "(###Human: )" prefix. I'm wondering why this prefix is considered in the final labels(targets). I assumed that only the sentences following "(###Assistant: )" should be considered.

Currently, the labels sequence looks like this: [-100, ..., -100, 835, 29950, 7889, 29901, -100, ..., -100, 835, 7900, ...]

Could you please provide some clarification on this matter? I would greatly appreciate it.

Thank you!

Andy1621 commented 3 months ago

Good question! Actually, both ways work. We include the ###Human because in the early version of vicuna, we find that it uses ###Human. However, we have tried to remove it, and it works too.

bexxnaz commented 3 months ago

Thanks a lot for your reply! I am working on a project where I intend to replace the decoder part of the model with the mT0-xl model (multilingual). However, I have some concerns regarding the usage of the (###human, ###assistant) prefix and its compatibility with this change. If you can help, I would greatly appreciate it.

Andy1621 commented 3 months ago

I do not have enough experience, that's why I follow the design of LLM training. So I suggest you conduct some ablations, or check how other MLLMs work ~

OpenGVLab / Ask-Anything

Question about training stage3 in videochat2 #152