Hi. I am doing train LLM Model with Turkish. I have a question. How LLM model know what Language dataset I have. I have text file in my converted dataset folder by train stages request. Should I add <|language|> code every start of text or edit cosyvoice.yaml whisper tokenizer language?
If I have to edit cosyvoice.yaml this mean I can one language train every train. But I want to add many language in model.So I want to train many language in one train. What should I do. For example this is my text folder in dataset.
Hi. I am doing train LLM Model with Turkish. I have a question. How LLM model know what Language dataset I have. I have text file in my converted dataset folder by train stages request. Should I add <|language|> code every start of text or edit cosyvoice.yaml whisper tokenizer language?
If I have to edit cosyvoice.yaml this mean I can one language train every train. But I want to add many language in model.So I want to train many language in one train. What should I do. For example this is my text folder in dataset.
uuid1 textexttext uuid2 textexttext2
this is my text folder. Should I like below
uuid1 <|tr|> textexttext uuid2 <|tr|> textexttext2
Can I like below If I can like above
uuid1 <|tr|> textexttext uuid2 <|it|> textexttext2 uuid3 <|fr|> textexttext3 uuid4 <|de|> textexttext4