Languages - Githubissues

davidsvson commented 7 years ago

Wow this looks great! Just wondering about language support. You have any plans to support other languages than English?

I'm planing to do LipSync for my game in Unity that has animations done in Spine (Esoteric Software).

DanielSWolf commented 7 years ago

I'm not planning to support languages other than English any time soon. The problem isn't so much the animation, but the voice recognition before it.

Let me know how your game goes -- I'm looking forward to seeing Rhubarb in action!

davidsvson commented 7 years ago

I have been looking at papagayo, and someone made a add on with support for 10 more languages (released under gpl). http://www.lostmarble.com/forum/viewtopic.php?t=5056 Would it be something usable? (I have no idea neither technical or license wise) Download the different language breakdowns here: http://users.monash.edu.au/~myless/catnap/misc/new-languages-3.zip

Thanks you for a great tool anyhow!

DanielSWolf commented 7 years ago

Papagayo and Rhubarb work a bit differently. If I remember correctly, Papagayo requires perfect dialog. It converts the dialog into phones (that's what the plugin you linked does), then leaves it to the user to align the phones with the recording. After that, I believe Papagayo does a simple mapping to convert the phones into mouth shapes.

For Rhubarb, on the other hand, converting the dialog text into phones is merely the first step. (This is where your plugin might help.)

After that, Rhubarb performs actual with voice recognition (guided by the dialog text, which is optional) and automatic alignment. These are the problematic steps, as they require acoustic and language models describing the target language. These models take months or even years to create. For more information, see this link on training acoustic models.

Finally, Rhubarb applies a pipeline of transformation steps to convert the timed phones into animation. These steps are rather language-independent.

davidsvson commented 7 years ago

Thank you for your very well explained answer!

nshmyrev commented 7 years ago

We can add languages like Swedish if you are interested in it.

davidsvson commented 7 years ago

Yes I'm very interested in that!

DanielSWolf commented 7 years ago

Hi Nickolay, nice to see you here! :-)

I'll think about what it would take to make Rhubarb multi-lingual. That may take some time, though; I'm currently rather busy making last-minute improvements for Thimbleweed Park.

DanielSWolf commented 7 years ago

I spent some time reflecting on what it would take to support other languages in the same quality as English. Here's what I came up with:

Acoustic and language models

There are Sphinx models for a variety of languages, including German, French, and Italian. They may pose some problems, though:
- Some of the models are under the GPL. Right now, all parts of Rhubarb (including the English model) are under more permissive, non-copyleft licenses. Shipping Rhubarb with these models would effectively put Rhubarb under the GPL. I don't want that.
- I have no idea about the quality of these models. Using the US-English model, I'm getting about 75% accuracy. I would like to get at least similar accuracy for other languages.
Model versioning

Right now, the English model (about 80MB including dictionary) resides in the main Git repo. That is already bloating the repo; a fresh clone takes quite some time. Before adding more models, I'd like to find a way to version them separately. One approach might be Git LFS.

Another idea would be to create separate Git repos for the individual models. This way, each of these repos could also contain language-specific code (G2P etc.). However, that would require some sort of plugin architecture for languages, which isn't trivial in C++. (Then again, it would allow me to use GLP models.)
Text normalization

We need a way to split dialog text into normalized words. So, for instance, "I paid $400" should become "i paid four hundred dollars". Right now, I'm using Flite, but this only works for English text. For other languages, we'd need a different library. I already did some searching, but none of the ones I found looked like a good fit. Either they weren't written in portable C/C++, or the license was copyleft.

Alternatively, I could write G2P code myself. That shouldn't be too hard, but it'll probably be quite time-consuming. Plus, I only know German and French well enough to do that.
Grapheme-to-phoneme (G2P)

For dialog words that aren't in the dictionary, we need a way to guess their pronunciation. Currently, I'm using heuristic containing more than 200 transformation rules that were written specifically for American English. I was lucky that someone (Mark Rosenfelder) had already spent a lot of time creating these rules. That's not something I'd like to do for other languages.

Incidentally, this is exactly what the Papagayo add-on you mentioned does. But it's written in Python and licensed under the GPL. Plus, it only supports Arpabet phones (see the next point).

As with normalization, there are G2P libraries out there, but the ones I found didn't fit -- either for technical reasons or because of their license. Same goes for using deep learning libraries to do G2P.
Phone set

Rhubarb works with phones internally. That poses a rather stupid problem: The standard notation for phones is the International Phonetic Alphabet (IPA). Unfortunately, not all C++ compilers support identifiers like aʊ or dʒ. (Specifically, GCC doesn't. Clang, Xcode and Visual Studio do.) Until now, that hasn't been a problem, since I only needed the phones that occur in American English. And there is a transcription called Arpabet that covers exactly those phones, while using only ASCII characters. For instance, dʒ is written as JH.

Unfortunately, there doesn't seem to be a similar transcription system that covers all IPA phones. Which leaves me with a number of undesirable choices:
- I could use IPA characters in code, losing GCC compatibility. But I really like the fact that Rhubarb works with all major compilers.
- I could use IPA characters in code, then perform some pre-processing on the code files to turn all IPA characters into escape sequences. But I don't know of an elegant way to do that.
- I could use the official long names. But using voicedPalatoAlveolarAffricate instead of dʒ would turn my code into some unreadable mess.
- I could come up with a clever transcription system of my own. Not liking that option.
- I could continue working with Arpabet phones internally, forcing every language plugin to convert the actual phones to the closest Arpabet phones. But there are a lot of sounds not covered by Arpabet, and I'd hate to lose these nuance when computing mouth movements. (Right now, it wouldn't make much of a difference. But I might want to add language-specific optional mouth shapes in the future, and they would really need this information.)
Animation rules

Finally, I'd need to write animation rules for the new phones. At least this step shouldn't be much of a problem: Animation rules are fast and fun to write.

Bottom line: I might tackle these things at some point in the future. But it will be a lot of work, so I'm not making any promises. Right now, this isn't my focus.

davidsvson commented 7 years ago

Very interesting read, thanks for the summary

saurabhshri commented 7 years ago

@DanielSWolf Thanks for explaining. I am planning to work with Phones too, and I cam clearly see how they may affect including other languages in future. Is there any means through which I can contact you?

DanielSWolf commented 7 years ago

@saurabhshri: I just PM'ed you.

DanielSWolf commented 7 years ago

I did some more thinking on this topic and I've come up with a solution for multi-language support that should work well. However, I won't have time to implement this feature any time soon.

Here's a rough sketch: Each supported language is modeled as a plugin. A plugin is a directory (or archive?) that can be placed into the Rhubarb directory, where Rhubarb will find it. A new command-line option allows you to specify the language. The default is 'en' for English. The English language is modeled as a plugin just like all others, but is included in the Rhubarb binary release so that it works out of the box. All other plugins have to be downloaded manually.

A plugin contains the following:

A Pocketsphinx-compatible acoustic model
A Pocketsphinx-compatible language model including language dictionary
Code for text normalization. This needn't be general-purpose normalization, it just needs to work with the language dictionary.
Code that turns the language model-specific representation of a phone into the corresponding IPA phone
Optional: Code for word transformation (see commit 50a27e7da0)

A downloaded plugin should work as-is on any platform. If the code it contains were written in C/C++, we would need a platform-specific compilation step. So we should use an embedded scripting language instead. I'm thinking of Lua: It's easy to compile and integrate, very small, and well-documented.

Each plugin is maintained in its own Git repo. A trivial build task converts it to an archive file for release.

There are still a couple of rough edges:

Do we need a build task, or do we simply mark a Git revision as release and use GitHub's "Download source code" feature? The latter is easier for us, but potentially confusing.
Do we need a technical mechanism to indicate which version of a plugin is compatible with which version of Rhubarb?
At runtime, Rhubarb needs the contents of a plugin as individual files. Do we ask the user to manually extract the archive file they downloaded, or should Rhubarb do that behind the scenes? The latter would be more convenient, but means that we need to implement a way to figure out whether the already extracted files match the files in the archive. We do not want to re-extract the archive every time Rhubarb is called. We need to cover the case that the user replaces an existing archive with a different version. And it's always possible that multiple instances of Rhubarb are running in parallel. (This is getting surprisingly difficult. Maybe just stick with the manual approach.)

This approach should cover all the problems I mentioned above:

Acoustic and language models: Turning each language into a separate plugin allows us to attach different open-source licences to individual plugins. So if a given Pocketsphinx model uses a copyleft licence like the GPL, this only affects the plugin, not Rhubarb itself or the resulting lip-sync output. ✔️
Model versioning: Having each language in its own Git repo prevents the main repo from swelling. ✔️
Text normalization: This still needs to be implemented by hand for each language. But, as I wrote above, it needn't be general-purpose code. It just needs to be good enough to output words that are compatible with the language dictionary and do some basic transformations for numbers, symbols, and currencies. ✔️
Grapheme-to-phoneme (G2P): I found a paper from 1997 called Language-independent data-oriented grapheme-to-phoneme conversion that describes a way to automatically generate G2P rules from an existing language dictionary (which we have for each language). The implementation isn't trivial, but far more straightforward than the deep learning-based approaches that are used nowadays. My guess is that the results will be much more accurate than the current solution and thus more than sufficient. ✔️
Phone set: There are actually very few places in code that will have to deal with IPA phone symbols. For the longest part of the process, we'll be dealing with language-specific ASCII-based codes, which we have to model as strings anyway. Then the plugin converts them into IPA symbols, which will again be returned as strings. The only place where we then deal with these symbols in C++ code is during rough animation. And as much as I love strong typing, I can bear handling them as strings there, too. No need for Unicode identifiers. ✔️
Animation rules: It will mean some work to write animation rules for all IPA letters and common diphthongs. But it should be manageable, especially since many of those will form clusters of identical animation. ✔️

nshmyrev commented 7 years ago

Sounds nice, actually I think espeak is a nice project to look on, it is not a state of the art in synthesis, but language support for preprocessing and g2p is very good.

DanielSWolf commented 7 years ago

I'll have a look at it. Thank you!

PiOverFour commented 5 years ago

Hi @DanielSWolf, I work at a French animation studio and would be very much interested in a Rhubarb module for the French Language. Has there been any progress on the matter of multi-language support? I could contribute somewhat, although I have no C++ or computational linguistics skills. But if the basic plugin structure you've outline is sufficiently advanced, I could at least contribute to a French language plugin.

DanielSWolf commented 5 years ago

@PiOverFour Thanks for offering your help! I am still planning to support additional languages in the future. In fact, I feel that this is one of the key missing features right now.

However, I'm currently re-thinking the technical implementation. Instead of relying solely on PocketSphinx, I'm thinking about adding support for other voice recognition services, such as the cloud services offered by Google, Microsoft, and IBM. They are not free, but they offer higher recognition rates than PocketSphinx for a large number of languages.

It will still take a long time until I've finished the technical basis. When that time has come, I'll certainly get back to you if I need input from a native French speaker!

PiOverFour commented 5 years ago

Oh, that's cool to hear! Looking forward to it.

tnelsond commented 5 years ago

@Danielswolf Look at X-SAMPA it's a phonetic alphabet that fits in 7 bit ASCII

DanielSWolf commented 5 years ago

Thanks, but that doesn't really solve my problem. My problem is not with Unicode per se, but with Unicode identifiers.

Right now, I have an enum that looks like this:

enum class Phone { AO, AA, IY, ... }

It covers the basic US-English ARPAbet phonemes. In order to support multiple languages, I will have to represent the full IPA set in a similar fashion. Ideally, I'd like to do this:

enum class Phone { ɸ, ɳ, ʔ, ... }

The C++11 standard will let me do this, since these Unicode characters are valid within identifiers. But GCC won't (see above). Using X-SAMPA isn't an option; this just isn't valid C++:

enum class Phone { p\, n`, ?, ... }

But as I wrote above, I can easily circumvent the problem altogether by representing phonemes as strings, not enum values. This way, I can use the real IPA characters with any compiler.

morevnaproject commented 5 years ago

Latest versions of pocketsphinx is capable to output phonemes instead of words - https://cmusphinx.github.io/wiki/phonemerecognition/ Would that simplify the process and allow to eliminate the task of word-to-phoneme conversion? (as I understood from https://github.com/DanielSWolf/rhubarb-lip-sync/issues/5#issuecomment-272287356 this is what Rhubarb currently does).

DanielSWolf commented 5 years ago

Funny that you mention that!

Phonetic recognition is not a new feature in PocketSphinx. In fact, the very first version of Rhubarb used it. The problem was that the error rate with this model is rather high. I discovered that the error rate dropped significantly when recognizing words instead of phones.

This, of course, only applies to English dialog. So right now, I'm in the process of adding optional phonetic recognition back into Rhubarb. This should give better results for languages other than English. For details, see this thread, starting at the linked comment.

This is only a temporary solution. In the long run, I still plan to implement full (word-based) recognition for languages other than English.

morevnaproject commented 5 years ago

@DanielSWolf Woah, that's great! Can I have access to source of the customized version of Rhubarb mentioned here? - https://forums.thimbleweedpark.com/t/thimbleweed-park-italian-fan-dub-project-official-thread-tm/2102/361 I am on Linux, and building from source is not a problem for me. Also, this way I can try to tweak Rhubarb sources. My goal is to get Russian lipsync for my project. Thank you!

DanielSWolf commented 5 years ago

I'll push a branch as soon as I get a chance. Might be a few days, though.

morevnaproject commented 5 years ago

Great! Looking forward to it. ^__^

DanielSWolf commented 5 years ago

I've created a new issue (#45) for phonetic recognition so that this issue can focus on true multi-language support.

maxbaluev commented 4 years ago

We need lip sync in Russian and Chinese, we make our own TTS, so we don't need to recognize audio. Is it possible to do this with the current version of the software?

DanielSWolf commented 4 years ago

Out of the box, Rhubarb only comes with two recognition modes. If you are prepared to make source code changes, you could implement your own Recognizer that feeds from your TTS data. All you need is a list of ARPAbet phones along with their timings.

kapamees commented 4 years ago

Hi Daniel! Thank for your effort and contribution for creating such a great program. I have no language issue but somewhat relates. I was just wondering if its possible to create lipsync out of plain text? No audio. For text based game.

DanielSWolf commented 4 years ago

Hi @kapamees! The short answer is: No, out of the box, Rhubarb cannot animate without a sound file. Modifying it to work on dialog alone would require moderate programming skills in C++.

If you're interested in making these modifications yourself and need some guidance, feel free to create a new issue.

kapamees commented 4 years ago

@DanielSWolf Thank you, no worries! Im no good at it ;)

DanielSWolf / rhubarb-lip-sync

Languages #5