automatically detect language of text being processed

BBC-Esq commented 7 months ago

Instead of having to pass the language identifiers (e.g. de or pl) perhaps autodetect language in a multilingual text string. Possible libraries like langdetect might be used:

Example:

from langdetect import detect

# Example texts
text_de = "Dies ist ein deutscher Text."
text_pl = "To jest polski tekst."

# Detecting languages
language_de = detect(text_de)
language_pl = detect(text_pl)

print(f"The language of the first text is: {language_de}")  # Output: de
print(f"The language of the second text is: {language_pl}")  # Output: pl

The challenge would be to implement the language detection in a single text string since langdetect is geared towards detecting the "predominant" language in a single string of text...But assuming we could parse it intelligently (or there's another better library), it would remove the need to pass the language identifiers to the methods in the WhisperSpeech library...

jpc commented 7 months ago

Yeah, that would be a nice idea although you pointed out correctly that language switching is going to be challenging.

We could try to train a model that would detect the language of each input token but I am not sure how well it would work in practice.

A bit related: there is a different "API" in the Gradio demo where you can specify the language inside the text string with html-like tags. Have you seen it?

BBC-Esq commented 7 months ago

I have another idea then...What about changing the default to "auto" so a user doesn't have to (but can) specify a language? For example, within pipeline.py it states:

    def generate_to_file(self, fname, text, speaker=None, lang='en', cps=15, step_callback=None):
        self.vocoder.decode_to_file(fname, self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None))

Could we set the default as lang=auto instead. Then we'd simply modify the source code to use langdetect on the input text to get the langauge identifier whenever a language isn't specified since "auto" is the default. This would prevent a user from having to get the langauge codes themselves and specify each time.

We'd still leave in place the functionality of a user being able to specify the language, however, auto-detect would be the default.

For example, this would enable users to choose auto-detect for sentences that are only one language and langdetect would have no problem identifying the language...while still keeping the ability to specify multiple languages when the text string is multilingual?

jpc commented 7 months ago

Yeah, that sounds nice. I’d like to move away from the lang= parameter but we could use this auto detection if there are no tags in the text.

BBC-Esq commented 7 months ago

Sounds good. It would require modifying the source code somewhat and I might be able to take that on, but I haven't had the time to analyze the code base further. If you're willing, can you explain briefly how the language parameter operates? I see the language script, but can you explain perhaps, for example...

1) lang= is passed to script A 2) then it's passed to script B 3) then the languages.py script is consulted... 4) and so on...

I only ask because this is a hobby of mine and I'm not a programmer by trade...and if I had a summary of the flow of the program it'd save me a lot of time. For example, my basic understand so far is that (using generate_to_file as an example) is:

1) runs generate_atoks 2) which runs t2s.generate 3) and runs s2a.generate 4) returning back to pipeline.py, then runs vocoder.decode_to_file using what it obtained from generate_atoks

As an amateur this took me hours to understand, so any help would be much appreciated since I'd like to contribute more efficiently!

BBC-Esq commented 7 months ago

@jpc Just to give you an idea, I didn't know what the word "python" even meant until approximately 9 months ago. ;-)

collabora / WhisperSpeech

automatically detect language of text being processed #99