Open xuanhan863 opened 4 days ago
you may want to specifically tell the language parameter to chinese, maybe the language detector is having trouble detecting the language, can you post your code?
Very simple test code:
`from auralis import TTS, TTSRequest
tts = TTS().from_pretrained('AstraMindAI/xttsv2')
request = TTSRequest( text="你好, 很高兴认识你.", language="zh-cn", speaker_files=['reference.wav'] ) tts.generate_speech(request)`
And I've confirmed that the language check outputs zh-cn when the language parameter is specified as auto.
langid.classify(text)[0].strip()
Did you try it with the default xttsv2 implementation from coqui as a sanity check? if with thair it sounds ok I will dig deeper
any update for this issue? I met same problem. Default xttsv2 implementation result is well, but this project result is worse using same referrrence speaker to Chinese output.
I tried to change the code in tokenizer.py:
# # Preprocess each text in the batch with its corresponding language
# processed_texts = []
# for text, text_lang in zip(batch_text_or_text_pairs, lang):
# if isinstance(text, str):
# # Check length and preprocess
# #self.check_input_length(text, text_lang)
# processed_text = self.preprocess_text(text, text_lang)
# # Format text with language tag and spaces
# base_lang = text_lang.split("-")[0]
# lang_code = "zh-cn" if base_lang == "zh" else base_lang
# processed_text = f"[{lang_code}]{processed_text}"
# processed_text = processed_text.replace(" ", "[SPACE]")
# processed_texts.append(processed_text)
# else:
# processed_texts.append(text)
# Call the parent class's encoding method with processed texts
return super()._batch_encode_plus(
# processed_texts,
batch_text_or_text_pairs,
add_special_tokens=add_special_tokens,
padding_strategy=padding_strategy,
truncation_strategy=truncation_strategy,
max_length=max_length,
stride=stride,
is_split_into_words=is_split_into_words,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
**kwargs
)
It works for me.
@wjddd can you open a pr request?
@wjddd can you open a pr request?
Sure.
I tried to change the code in tokenizer.py:
# # Preprocess each text in the batch with its corresponding language # processed_texts = [] # for text, text_lang in zip(batch_text_or_text_pairs, lang): # if isinstance(text, str): # # Check length and preprocess # #self.check_input_length(text, text_lang) # processed_text = self.preprocess_text(text, text_lang) # # Format text with language tag and spaces # base_lang = text_lang.split("-")[0] # lang_code = "zh-cn" if base_lang == "zh" else base_lang # processed_text = f"[{lang_code}]{processed_text}" # processed_text = processed_text.replace(" ", "[SPACE]") # processed_texts.append(processed_text) # else: # processed_texts.append(text) # Call the parent class's encoding method with processed texts return super()._batch_encode_plus( # processed_texts, batch_text_or_text_pairs, add_special_tokens=add_special_tokens, padding_strategy=padding_strategy, truncation_strategy=truncation_strategy, max_length=max_length, stride=stride, is_split_into_words=is_split_into_words, pad_to_multiple_of=pad_to_multiple_of, return_tensors=return_tensors, return_token_type_ids=return_token_type_ids, return_attention_mask=return_attention_mask, return_overflowing_tokens=return_overflowing_tokens, return_special_tokens_mask=return_special_tokens_mask, return_offsets_mapping=return_offsets_mapping, return_length=return_length, verbose=verbose, **kwargs )
It works for me.
It works for me too. Thank a lot. I met another issue. I trid to produce a long text voice with chinese man voice as referrence audio, this project will produce a voice as foreigner speak Chinese. But I use default xttsv2, there is not this issue.
There seems to be some problem with the generation of chinese, with a chunk sequence like this: ['ni3hao3, hen3gao1xing4ren4shi2ni3'], the output audio is unintelligible.