Open osmankrblt opened 1 week ago
check whisper tokenizer, add language token at sentence start.
@aluminumbox should we also change the language
parameter in the get_tokenizer config?
get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
multilingual: True
num_languages: 100
language: 'en'
task: 'transcribe'
check whisper tokenizer, add language token at sentence start.
For example, I want to use the voice of a Turkish speaker while reading text in a different language. Do I need to train for this? Should I train with Turkish speaker embeddings?
He says very meaningless sounds. The voice transcription is successful, but it does not say the words correctly. I couldn't understand what language you were speaking
# cross_lingual usage
prompt_speech_16k = load_wav('../Custom Voices/ voice.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|tr|> ' + "Her savaşta fırtınalar arasında sakinlik olur. İnancımızı kaybettiğimiz günler olacak. Müttefiklerimizin bize sırt çevirdiği günler… ama bu gezegeni ve halkını terk edeceğimiz gün asla gelmeyecek.", prompt_speech_16k, stream=False)):
torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)
@aluminumbox should we also change the
language
parameter in the get_tokenizer config?get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe multilingual: True num_languages: 100 language: 'en' task: 'transcribe'
no
He says very meaningless sounds. The voice transcription is successful, but it does not say the words correctly. I couldn't understand what language you were speaking
# cross_lingual usage prompt_speech_16k = load_wav('../Custom Voices/ voice.wav', 16000) for i, j in enumerate(cosyvoice.inference_cross_lingual('<|tr|> ' + "Her savaşta fırtınalar arasında sakinlik olur. İnancımızı kaybettiğimiz günler olacak. Müttefiklerimizin bize sırt çevirdiği günler… ama bu gezegeni ve halkını terk edeceğimiz gün asla gelmeyecek.", prompt_speech_16k, stream=False)): torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)
because we have no Turkish in our training data
He says very meaningless sounds. The voice transcription is successful, but it does not say the words correctly. I couldn't understand what language you were speaking
# cross_lingual usage prompt_speech_16k = load_wav('../Custom Voices/ voice.wav', 16000) for i, j in enumerate(cosyvoice.inference_cross_lingual('<|tr|> ' + "Her savaşta fırtınalar arasında sakinlik olur. İnancımızı kaybettiğimiz günler olacak. Müttefiklerimizin bize sırt çevirdiği günler… ama bu gezegeni ve halkını terk edeceğimiz gün asla gelmeyecek.", prompt_speech_16k, stream=False)): torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)
because we have no Turkish in our training data
How can I add Turkish data and train? I can use commonvoice dataset as an example
@aluminumbox is 800-900 hours audio enough for training new languages? or should I train it from scratch?
@aluminumbox is 800-900 hours audio enough for training new languages? or should I train it from scratch?
no need to train from scratch, I think at least 5k+ hour is suitable to cover a new language
no need to train from scratch, I think at least 5k+ hour is suitable to cover a new language
should I finetune the llm and the flow, or just llm is enough?
@aluminumbox
def forward(
self,
batch: dict,
device: torch.device,
) -> Dict[str, Optional[torch.Tensor]]:
"""
Args:
text: (B, L, D)
text_lengths: (B,)
audio: (B, T, N) or (B, T)
audio_lengths: (B,)
"""
text_token = batch['text_token'].to(device)
text_token_len = batch['text_token_len'].to(device)
speech_token = batch['speech_token'].to(device)
speech_token_len = batch['speech_token_len'].to(device)
embedding = batch['embedding'].to(device)
For example, this is the LLM forwrd track. Considering the data containing a voice and text, how can I bring the data to this state? How will I embed it into LLM? How can ı get this embedding = batch['embedding'].to(device)
import torch
batch_size = 2
max_text_token_len = 1000
max_speech_token_len = 16000*10
embedding_dim = 192 # Embedding boyutu
# Dummy text_token, speech_token ve embedding verileri oluşturalım
dummy_batch = {
'text_token': torch.randint(0, 1000, (batch_size, max_text_token_len)).to('cuda'), # Text token verisi (örneğin 0-1000 arasında tokenlar)
'text_token_len': torch.tensor([max_text_token_len, max_text_token_len]).to('cuda'), # Text token uzunlukları
'speech_token': torch.randn(batch_size, max_speech_token_len).to('cuda').long(), # Speech token verisi
'speech_token_len': torch.tensor([ max_speech_token_len, max_speech_token_len]).to('cuda'), # Speech token uzunlukları
'embedding': torch.randn(batch_size, embedding_dim).to('cuda') # Embedding verisi (batch_size, embedding_dim)
}
llm_model.forward(dummy_batch, "cpu")
{ "name": "IndexError", "message": "index out of range in self", "stack": "--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[63], line 1 ----> 1 model.forward(dummy_batch, \"cpu\")
File cosyvoice/llm/llm.py:126, in TransformerLM.forward(self, batch, device) 123 task_id_emb = self.llm_embedding.weight[self.task_id].reshape(1, 1, -1) 125 # 4. encode speech_token --> 126 speech_token = self.speech_embedding(speech_token) 128 # 5. unpad and pad 129 lm_input, lm_input_len = self.pad_unpad_sequence(sos_eos_emb, embedding, text_token, text_token_len, 130 task_id_emb, speech_token, speech_token_len)
File torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
File torch/nn/modules/sparse.py:162, in Embedding.forward(self, input) 161 def forward(self, input: Tensor) -> Tensor: --> 162 return F.embedding( 163 input, self.weight, self.padding_idx, self.max_norm, 164 self.norm_type, self.scale_grad_by_freq, self.sparse)
File torch/nn/functional.py:2210, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2204 # Note [embedding_renorm set_grad_enabled] 2205 # XXX: equivalent to 2206 # with torch.no_grad(): 2207 # torch.embeddingrenorm 2208 # remove once script supports set_grad_enabled 2209 _no_grad_embeddingrenorm(weight, input, max_norm, norm_type) -> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self" }
I got this error. I solved embedding values I think. @aluminumbox
What should be the LLM input shapes @aluminumbox ?
Hello. I want to add Turkish to the CosyVoice model. How do I add a new language? What should I do? I want to add a new language and use it by fine tuning it.