OpenMOSS / AnyGPT

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
740 stars 57 forks source link

data format in mmichat_*.jsonl #34

Open hchc007 opened 1 month ago

hchc007 commented 1 month ago

Thank you for the great work!

Could you please share the data format or provide an example row from the mmichat_speech.jsonl file? In anygpt/src/train/stage2_sft.py, the preprocess function maps raw_datasets to tokenized_datasets. However, I'm a bit confused about how this processing works.

It would be very helpful if you could provide a short example or sample in JSONL format.

JunZhan2000 commented 1 month ago

Hello, we provide some training data samples and related descriptions, please refer to https://github.com/OpenMOSS/AnyGPT?tab=readme-ov-file#pretraining-and-sft