FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.42k stars 692 forks source link

MultiLanguage and Single Language Train Cross Lingual #483

Open osmankrblt opened 1 month ago

osmankrblt commented 1 month ago

Hi. I am doing train LLM Model with Turkish. I have a question. How LLM model know what Language dataset I have. I have text file in my converted dataset folder by train stages request. Should I add <|language|> code every start of text or edit cosyvoice.yaml whisper tokenizer language?

If I have to edit cosyvoice.yaml this mean I can one language train every train. But I want to add many language in model.So I want to train many language in one train. What should I do. For example this is my text folder in dataset.

uuid1 textexttext uuid2 textexttext2

this is my text folder. Should I like below

uuid1 <|tr|> textexttext uuid2 <|tr|> textexttext2

Can I like below If I can like above

uuid1 <|tr|> textexttext uuid2 <|it|> textexttext2 uuid3 <|fr|> textexttext3 uuid4 <|de|> textexttext4

aluminumbox commented 1 month ago

both are ok

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 30 days with no activity.