Open cseti007 opened 4 months ago
I've had success training in Spanish with ~70 hours. But I'm getting an issue where proper nouns aren't being said properly. And the pronunciation isn't always ideal
of course you can, check whisper tokenizer and add <|your language|> at sentence start
@aluminumbox i'm getting a weird issue in spanish where proper nouns / uncommon words aren't being said properly - think it might be a tokenizer issue. do you have any idea how the BPE tokenizer would react to a new language and a reason why it would struggle with proper nouns / uncommon words?
@aluminumbox i'm getting a weird issue in spanish where proper nouns / uncommon words aren't being said properly - think it might be a tokenizer issue. do you have any idea how the BPE tokenizer would react to a new language and a reason why it would struggle with proper nouns / uncommon words?
we use whisper tokenizer, check cosyvoice.yaml, we also do not have enough experience in spanish tokenization
hello @rlenain, are you training only llm model or also flow model? and how much GPU resources you use for Spanish training.
hi @aluminumbox , do you think it's better to train cosyvoice from scratch or just finetune the CosyVoice-300M base model if I want to train on new language? Also, should I train both llm and flow if I want to finetune it?
Thanks for your great work! I'm just wondering how big dataset is recommended from training from scratch for other languages?
Thank you!