Could you please share the data format or provide an example row from the mmichat_speech.jsonl file?
In anygpt/src/train/stage2_sft.py, the preprocess function maps raw_datasets to tokenized_datasets. However, I'm a bit confused about how this processing works.
It would be very helpful if you could provide a short example or sample in JSONL format.
Thank you for the great work!
Could you please share the data format or provide an example row from the mmichat_speech.jsonl file? In anygpt/src/train/stage2_sft.py, the preprocess function maps raw_datasets to tokenized_datasets. However, I'm a bit confused about how this processing works.
It would be very helpful if you could provide a short example or sample in JSONL format.