Training Details - Githubissues

hi @loveofguoke @longqian-zju

I have the following queries on training the model.

Do we train the model with image-text or video-text pairs? If yes, then which datasets are used for the same.
How to train with GESM data? The converted data do not have conversations key, however, the source code expect this information. Moreover after conversion, the only present keys are id, data_type, data as used in convert_data_gesm.py.
I assume we first need to train with GESM data and then perform supervised instruction fine-tuning with Moment-10M. Please confirm.

It would be really helpful if you can share the information to train the model along with appropriate dataset format.

Thanks in advance!

DCDmllm / Momentor