multi-language generation problem

xuanhan863 commented 4 days ago

There seems to be some problem with the generation of chinese, with a chunk sequence like this: ['ni3hao3, hen3gao1xing4ren4shi2ni3'], the output audio is unintelligible.

mlinmg commented 3 days ago

you may want to specifically tell the language parameter to chinese, maybe the language detector is having trouble detecting the language, can you post your code?

xuanhan863 commented 3 days ago

Very simple test code:

`from auralis import TTS, TTSRequest

tts = TTS().from_pretrained('AstraMindAI/xttsv2')

request = TTSRequest( text="你好, 很高兴认识你.", language="zh-cn", speaker_files=['reference.wav'] ) tts.generate_speech(request)`

And I've confirmed that the language check outputs zh-cn when the language parameter is specified as auto. langid.classify(text)[0].strip()

mlinmg commented 2 days ago

Did you try it with the default xttsv2 implementation from coqui as a sanity check? if with thair it sounds ok I will dig deeper

dttlgotv commented 1 day ago

any update for this issue? I met same problem. Default xttsv2 implementation result is well, but this project result is worse using same referrrence speaker to Chinese output.

wjddd commented 1 day ago

I tried to change the code in tokenizer.py:

        # # Preprocess each text in the batch with its corresponding language
        # processed_texts = []
        # for text, text_lang in zip(batch_text_or_text_pairs, lang):
        #     if isinstance(text, str):
        #         # Check length and preprocess
        #         #self.check_input_length(text, text_lang)
        #         processed_text = self.preprocess_text(text, text_lang)

        #         # Format text with language tag and spaces
        #         base_lang = text_lang.split("-")[0]
        #         lang_code = "zh-cn" if base_lang == "zh" else base_lang
        #         processed_text = f"[{lang_code}]{processed_text}"
        #         processed_text = processed_text.replace(" ", "[SPACE]")

        #         processed_texts.append(processed_text)
        #     else:
        #         processed_texts.append(text)

        # Call the parent class's encoding method with processed texts
        return super()._batch_encode_plus(
            # processed_texts,
            batch_text_or_text_pairs,
            add_special_tokens=add_special_tokens,
            padding_strategy=padding_strategy,
            truncation_strategy=truncation_strategy,
            max_length=max_length,
            stride=stride,
            is_split_into_words=is_split_into_words,
            pad_to_multiple_of=pad_to_multiple_of,
            return_tensors=return_tensors,
            return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length,
            verbose=verbose,
            **kwargs
        )

It works for me.

mlinmg commented 21 hours ago

@wjddd can you open a pr request?

wjddd commented 6 hours ago

@wjddd can you open a pr request?

Sure.

dttlgotv commented 4 hours ago

I tried to change the code in tokenizer.py:

        # # Preprocess each text in the batch with its corresponding language
        # processed_texts = []
        # for text, text_lang in zip(batch_text_or_text_pairs, lang):
        #     if isinstance(text, str):
        #         # Check length and preprocess
        #         #self.check_input_length(text, text_lang)
        #         processed_text = self.preprocess_text(text, text_lang)

        #         # Format text with language tag and spaces
        #         base_lang = text_lang.split("-")[0]
        #         lang_code = "zh-cn" if base_lang == "zh" else base_lang
        #         processed_text = f"[{lang_code}]{processed_text}"
        #         processed_text = processed_text.replace(" ", "[SPACE]")

        #         processed_texts.append(processed_text)
        #     else:
        #         processed_texts.append(text)

        # Call the parent class's encoding method with processed texts
        return super()._batch_encode_plus(
            # processed_texts,
            batch_text_or_text_pairs,
            add_special_tokens=add_special_tokens,
            padding_strategy=padding_strategy,
            truncation_strategy=truncation_strategy,
            max_length=max_length,
            stride=stride,
            is_split_into_words=is_split_into_words,
            pad_to_multiple_of=pad_to_multiple_of,
            return_tensors=return_tensors,
            return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length,
            verbose=verbose,
            **kwargs
        )

It works for me.

It works for me too. Thank a lot. I met another issue. I trid to produce a long text voice with chinese man voice as referrence audio, this project will produce a voice as foreigner speak Chinese. But I use default xttsv2, there is not this issue.

astramind-ai / Auralis

multi-language generation problem #1