Open BBC-Esq opened 7 months ago
Yeah, that would be a nice idea although you pointed out correctly that language switching is going to be challenging.
We could try to train a model that would detect the language of each input token but I am not sure how well it would work in practice.
A bit related: there is a different "API" in the Gradio demo where you can specify the language inside the text string with html-like tags. Have you seen it?
I have another idea then...What about changing the default to "auto" so a user doesn't have to (but can) specify a language? For example, within pipeline.py
it states:
def generate_to_file(self, fname, text, speaker=None, lang='en', cps=15, step_callback=None):
self.vocoder.decode_to_file(fname, self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None))
Could we set the default as lang=auto
instead. Then we'd simply modify the source code to use langdetect
on the input text to get the langauge identifier whenever a language isn't specified since "auto" is the default. This would prevent a user from having to get the langauge codes themselves and specify each time.
We'd still leave in place the functionality of a user being able to specify the language, however, auto-detect would be the default.
For example, this would enable users to choose auto-detect for sentences that are only one language and langdetect
would have no problem identifying the language...while still keeping the ability to specify multiple languages when the text string is multilingual?
Yeah, that sounds nice. I’d like to move away from the lang=
parameter but we could use this auto detection if there are no tags in the text.
Sounds good. It would require modifying the source code somewhat and I might be able to take that on, but I haven't had the time to analyze the code base further. If you're willing, can you explain briefly how the language parameter operates? I see the language script, but can you explain perhaps, for example...
1) lang= is passed to script A
2) then it's passed to script B
3) then the languages.py
script is consulted...
4) and so on...
I only ask because this is a hobby of mine and I'm not a programmer by trade...and if I had a summary of the flow of the program it'd save me a lot of time. For example, my basic understand so far is that (using generate_to_file
as an example) is:
1) runs generate_atoks
2) which runs t2s.generate
3) and runs s2a.generate
4) returning back to pipeline.py
, then runs vocoder.decode_to_file
using what it obtained from generate_atoks
As an amateur this took me hours to understand, so any help would be much appreciated since I'd like to contribute more efficiently!
@jpc Just to give you an idea, I didn't know what the word "python" even meant until approximately 9 months ago. ;-)
Instead of having to pass the language identifiers (e.g. de or pl) perhaps autodetect language in a multilingual text string. Possible libraries like
langdetect
might be used:Example:
The challenge would be to implement the language detection in a single text string since
langdetect
is geared towards detecting the "predominant" language in a single string of text...But assuming we could parse it intelligently (or there's another better library), it would remove the need to pass the language identifiers to the methods in the WhisperSpeech library...