How can I add new language?

osmankrblt commented 1 week ago

Hello. I want to add Turkish to the CosyVoice model. How do I add a new language? What should I do? I want to add a new language and use it by fine tuning it.

aluminumbox commented 1 week ago

check whisper tokenizer, add language token at sentence start.

justinatbahasa commented 1 week ago

@aluminumbox should we also change the language parameter in the get_tokenizer config?


get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
    multilingual: True
    num_languages: 100
    language: 'en'
    task: 'transcribe'

osmankrblt commented 1 week ago

check whisper tokenizer, add language token at sentence start.

For example, I want to use the voice of a Turkish speaker while reading text in a different language. Do I need to train for this? Should I train with Turkish speaker embeddings?

osmankrblt commented 1 week ago

He says very meaningless sounds. The voice transcription is successful, but it does not say the words correctly. I couldn't understand what language you were speaking

# cross_lingual usage
prompt_speech_16k = load_wav('../Custom Voices/ voice.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|tr|> ' + "Her savaşta fırtınalar arasında sakinlik olur. İnancımızı kaybettiğimiz günler olacak. Müttefiklerimizin bize sırt çevirdiği günler… ama bu gezegeni ve halkını terk edeceğimiz gün asla gelmeyecek.", prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)

aluminumbox commented 1 week ago

@aluminumbox should we also change the language parameter in the get_tokenizer config?

get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
    multilingual: True
    num_languages: 100
    language: 'en'
    task: 'transcribe'

no

aluminumbox commented 1 week ago

He says very meaningless sounds. The voice transcription is successful, but it does not say the words correctly. I couldn't understand what language you were speaking

# cross_lingual usage
prompt_speech_16k = load_wav('../Custom Voices/ voice.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|tr|> ' + "Her savaşta fırtınalar arasında sakinlik olur. İnancımızı kaybettiğimiz günler olacak. Müttefiklerimizin bize sırt çevirdiği günler… ama bu gezegeni ve halkını terk edeceğimiz gün asla gelmeyecek.", prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)

because we have no Turkish in our training data

osmankrblt commented 1 week ago

He says very meaningless sounds. The voice transcription is successful, but it does not say the words correctly. I couldn't understand what language you were speaking
# cross_lingual usage
prompt_speech_16k = load_wav('../Custom Voices/ voice.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|tr|> ' + "Her savaşta fırtınalar arasında sakinlik olur. İnancımızı kaybettiğimiz günler olacak. Müttefiklerimizin bize sırt çevirdiği günler… ama bu gezegeni ve halkını terk edeceğimiz gün asla gelmeyecek.", prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050) 
because we have no Turkish in our training data

How can I add Turkish data and train? I can use commonvoice dataset as an example

justinatbahasa commented 1 week ago

@aluminumbox is 800-900 hours audio enough for training new languages? or should I train it from scratch?

aluminumbox commented 1 week ago

@aluminumbox is 800-900 hours audio enough for training new languages? or should I train it from scratch?

no need to train from scratch, I think at least 5k+ hour is suitable to cover a new language

justinatbahasa commented 1 week ago

no need to train from scratch, I think at least 5k+ hour is suitable to cover a new language

should I finetune the llm and the flow, or just llm is enough?

osmankrblt commented 1 week ago

@aluminumbox

def forward(
            self,
            batch: dict,
            device: torch.device,
    ) -> Dict[str, Optional[torch.Tensor]]:
        """
        Args:
            text: (B, L, D)
            text_lengths: (B,)
            audio: (B, T, N) or (B, T)
            audio_lengths: (B,)
        """
        text_token = batch['text_token'].to(device)
        text_token_len = batch['text_token_len'].to(device)
        speech_token = batch['speech_token'].to(device)
        speech_token_len = batch['speech_token_len'].to(device)
        embedding = batch['embedding'].to(device)

For example, this is the LLM forwrd track. Considering the data containing a voice and text, how can I bring the data to this state? How will I embed it into LLM? How can ı get this embedding = batch['embedding'].to(device)

osmankrblt commented 1 week ago

import torch

batch_size = 2  
max_text_token_len = 1000  
max_speech_token_len = 16000*10  
embedding_dim = 192  # Embedding boyutu

# Dummy text_token, speech_token ve embedding verileri oluşturalım
dummy_batch = {
    'text_token': torch.randint(0, 1000, (batch_size, max_text_token_len)).to('cuda'),  # Text token verisi (örneğin 0-1000 arasında tokenlar)
    'text_token_len': torch.tensor([max_text_token_len, max_text_token_len]).to('cuda'),  # Text token uzunlukları
    'speech_token': torch.randn(batch_size, max_speech_token_len).to('cuda').long(),  # Speech token verisi
    'speech_token_len': torch.tensor([ max_speech_token_len, max_speech_token_len]).to('cuda'),  # Speech token uzunlukları
    'embedding': torch.randn(batch_size, embedding_dim).to('cuda')  # Embedding verisi (batch_size, embedding_dim)
}

llm_model.forward(dummy_batch, "cpu")

{ "name": "IndexError", "message": "index out of range in self", "stack": "--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[63], line 1 ----> 1 model.forward(dummy_batch, \"cpu\")

File cosyvoice/llm/llm.py:126, in TransformerLM.forward(self, batch, device) 123 task_id_emb = self.llm_embedding.weight[self.task_id].reshape(1, 1, -1) 125 # 4. encode speech_token --> 126 speech_token = self.speech_embedding(speech_token) 128 # 5. unpad and pad 129 lm_input, lm_input_len = self.pad_unpad_sequence(sos_eos_emb, embedding, text_token, text_token_len, 130 task_id_emb, speech_token, speech_token_len)

File torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []

File torch/nn/modules/sparse.py:162, in Embedding.forward(self, input) 161 def forward(self, input: Tensor) -> Tensor: --> 162 return F.embedding( 163 input, self.weight, self.padding_idx, self.max_norm, 164 self.norm_type, self.scale_grad_by_freq, self.sparse)

File torch/nn/functional.py:2210, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2204 # Note [embedding_renorm set_grad_enabled] 2205 # XXX: equivalent to 2206 # with torch.no_grad(): 2207 # torch.embeddingrenorm 2208 # remove once script supports set_grad_enabled 2209 _no_grad_embeddingrenorm(weight, input, max_norm, norm_type) -> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self" }

I got this error. I solved embedding values I think. @aluminumbox

osmankrblt commented 1 week ago

What should be the LLM input shapes @aluminumbox ?

FunAudioLLM / CosyVoice

How can I add new language? #466