Open davidsvson opened 7 years ago
I'm not planning to support languages other than English any time soon. The problem isn't so much the animation, but the voice recognition before it.
Let me know how your game goes -- I'm looking forward to seeing Rhubarb in action!
I have been looking at papagayo, and someone made a add on with support for 10 more languages (released under gpl). http://www.lostmarble.com/forum/viewtopic.php?t=5056 Would it be something usable? (I have no idea neither technical or license wise) Download the different language breakdowns here: http://users.monash.edu.au/~myless/catnap/misc/new-languages-3.zip
Thanks you for a great tool anyhow!
Papagayo and Rhubarb work a bit differently. If I remember correctly, Papagayo requires perfect dialog. It converts the dialog into phones (that's what the plugin you linked does), then leaves it to the user to align the phones with the recording. After that, I believe Papagayo does a simple mapping to convert the phones into mouth shapes.
For Rhubarb, on the other hand, converting the dialog text into phones is merely the first step. (This is where your plugin might help.)
After that, Rhubarb performs actual with voice recognition (guided by the dialog text, which is optional) and automatic alignment. These are the problematic steps, as they require acoustic and language models describing the target language. These models take months or even years to create. For more information, see this link on training acoustic models.
Finally, Rhubarb applies a pipeline of transformation steps to convert the timed phones into animation. These steps are rather language-independent.
Thank you for your very well explained answer!
We can add languages like Swedish if you are interested in it.
Yes I'm very interested in that!
Hi Nickolay, nice to see you here! :-)
I'll think about what it would take to make Rhubarb multi-lingual. That may take some time, though; I'm currently rather busy making last-minute improvements for Thimbleweed Park.
I spent some time reflecting on what it would take to support other languages in the same quality as English. Here's what I came up with:
Acoustic and language models
There are Sphinx models for a variety of languages, including German, French, and Italian. They may pose some problems, though:
Some of the models are under the GPL. Right now, all parts of Rhubarb (including the English model) are under more permissive, non-copyleft licenses. Shipping Rhubarb with these models would effectively put Rhubarb under the GPL. I don't want that.
I have no idea about the quality of these models. Using the US-English model, I'm getting about 75% accuracy. I would like to get at least similar accuracy for other languages.
Model versioning
Right now, the English model (about 80MB including dictionary) resides in the main Git repo. That is already bloating the repo; a fresh clone takes quite some time. Before adding more models, I'd like to find a way to version them separately. One approach might be Git LFS.
Another idea would be to create separate Git repos for the individual models. This way, each of these repos could also contain language-specific code (G2P etc.). However, that would require some sort of plugin architecture for languages, which isn't trivial in C++. (Then again, it would allow me to use GLP models.)
Text normalization
We need a way to split dialog text into normalized words. So, for instance, "I paid $400" should become "i paid four hundred dollars". Right now, I'm using Flite, but this only works for English text. For other languages, we'd need a different library. I already did some searching, but none of the ones I found looked like a good fit. Either they weren't written in portable C/C++, or the license was copyleft.
Alternatively, I could write G2P code myself. That shouldn't be too hard, but it'll probably be quite time-consuming. Plus, I only know German and French well enough to do that.
Grapheme-to-phoneme (G2P)
For dialog words that aren't in the dictionary, we need a way to guess their pronunciation. Currently, I'm using heuristic containing more than 200 transformation rules that were written specifically for American English. I was lucky that someone (Mark Rosenfelder) had already spent a lot of time creating these rules. That's not something I'd like to do for other languages.
Incidentally, this is exactly what the Papagayo add-on you mentioned does. But it's written in Python and licensed under the GPL. Plus, it only supports Arpabet phones (see the next point).
As with normalization, there are G2P libraries out there, but the ones I found didn't fit -- either for technical reasons or because of their license. Same goes for using deep learning libraries to do G2P.
Phone set
Rhubarb works with phones internally. That poses a rather stupid problem: The standard notation for phones is the International Phonetic Alphabet (IPA). Unfortunately, not all C++ compilers support identifiers like aʊ
or dʒ
. (Specifically, GCC doesn't. Clang, Xcode and Visual Studio do.) Until now, that hasn't been a problem, since I only needed the phones that occur in American English. And there is a transcription called Arpabet that covers exactly those phones, while using only ASCII characters. For instance, dʒ
is written as JH
.
Unfortunately, there doesn't seem to be a similar transcription system that covers all IPA phones. Which leaves me with a number of undesirable choices:
voicedPalatoAlveolarAffricate
instead of dʒ
would turn my code into some unreadable mess.Animation rules
Finally, I'd need to write animation rules for the new phones. At least this step shouldn't be much of a problem: Animation rules are fast and fun to write.
Bottom line: I might tackle these things at some point in the future. But it will be a lot of work, so I'm not making any promises. Right now, this isn't my focus.
Very interesting read, thanks for the summary
@DanielSWolf Thanks for explaining. I am planning to work with Phones too, and I cam clearly see how they may affect including other languages in future. Is there any means through which I can contact you?
@saurabhshri: I just PM'ed you.
I did some more thinking on this topic and I've come up with a solution for multi-language support that should work well. However, I won't have time to implement this feature any time soon.
Here's a rough sketch: Each supported language is modeled as a plugin. A plugin is a directory (or archive?) that can be placed into the Rhubarb directory, where Rhubarb will find it. A new command-line option allows you to specify the language. The default is 'en'
for English. The English language is modeled as a plugin just like all others, but is included in the Rhubarb binary release so that it works out of the box. All other plugins have to be downloaded manually.
A plugin contains the following:
A downloaded plugin should work as-is on any platform. If the code it contains were written in C/C++, we would need a platform-specific compilation step. So we should use an embedded scripting language instead. I'm thinking of Lua: It's easy to compile and integrate, very small, and well-documented.
Each plugin is maintained in its own Git repo. A trivial build task converts it to an archive file for release.
There are still a couple of rough edges:
This approach should cover all the problems I mentioned above:
Sounds nice, actually I think espeak is a nice project to look on, it is not a state of the art in synthesis, but language support for preprocessing and g2p is very good.
I'll have a look at it. Thank you!
Hi @DanielSWolf, I work at a French animation studio and would be very much interested in a Rhubarb module for the French Language. Has there been any progress on the matter of multi-language support? I could contribute somewhat, although I have no C++ or computational linguistics skills. But if the basic plugin structure you've outline is sufficiently advanced, I could at least contribute to a French language plugin.
@PiOverFour Thanks for offering your help! I am still planning to support additional languages in the future. In fact, I feel that this is one of the key missing features right now.
However, I'm currently re-thinking the technical implementation. Instead of relying solely on PocketSphinx, I'm thinking about adding support for other voice recognition services, such as the cloud services offered by Google, Microsoft, and IBM. They are not free, but they offer higher recognition rates than PocketSphinx for a large number of languages.
It will still take a long time until I've finished the technical basis. When that time has come, I'll certainly get back to you if I need input from a native French speaker!
Oh, that's cool to hear! Looking forward to it.
@Danielswolf Look at X-SAMPA it's a phonetic alphabet that fits in 7 bit ASCII
Thanks, but that doesn't really solve my problem. My problem is not with Unicode per se, but with Unicode identifiers.
Right now, I have an enum that looks like this:
enum class Phone { AO, AA, IY, ... }
It covers the basic US-English ARPAbet phonemes. In order to support multiple languages, I will have to represent the full IPA set in a similar fashion. Ideally, I'd like to do this:
enum class Phone { ɸ, ɳ, ʔ, ... }
The C++11 standard will let me do this, since these Unicode characters are valid within identifiers. But GCC won't (see above). Using X-SAMPA isn't an option; this just isn't valid C++:
enum class Phone { p\, n`, ?, ... }
But as I wrote above, I can easily circumvent the problem altogether by representing phonemes as strings, not enum values. This way, I can use the real IPA characters with any compiler.
Latest versions of pocketsphinx is capable to output phonemes instead of words - https://cmusphinx.github.io/wiki/phonemerecognition/ Would that simplify the process and allow to eliminate the task of word-to-phoneme conversion? (as I understood from https://github.com/DanielSWolf/rhubarb-lip-sync/issues/5#issuecomment-272287356 this is what Rhubarb currently does).
Funny that you mention that!
Phonetic recognition is not a new feature in PocketSphinx. In fact, the very first version of Rhubarb used it. The problem was that the error rate with this model is rather high. I discovered that the error rate dropped significantly when recognizing words instead of phones.
This, of course, only applies to English dialog. So right now, I'm in the process of adding optional phonetic recognition back into Rhubarb. This should give better results for languages other than English. For details, see this thread, starting at the linked comment.
This is only a temporary solution. In the long run, I still plan to implement full (word-based) recognition for languages other than English.
@DanielSWolf Woah, that's great! Can I have access to source of the customized version of Rhubarb mentioned here? - https://forums.thimbleweedpark.com/t/thimbleweed-park-italian-fan-dub-project-official-thread-tm/2102/361 I am on Linux, and building from source is not a problem for me. Also, this way I can try to tweak Rhubarb sources. My goal is to get Russian lipsync for my project. Thank you!
I'll push a branch as soon as I get a chance. Might be a few days, though.
Great! Looking forward to it. ^__^
I've created a new issue (#45) for phonetic recognition so that this issue can focus on true multi-language support.
We need lip sync in Russian and Chinese, we make our own TTS, so we don't need to recognize audio. Is it possible to do this with the current version of the software?
Out of the box, Rhubarb only comes with two recognition modes. If you are prepared to make source code changes, you could implement your own Recognizer
that feeds from your TTS data. All you need is a list of ARPAbet phones along with their timings.
Hi Daniel! Thank for your effort and contribution for creating such a great program. I have no language issue but somewhat relates. I was just wondering if its possible to create lipsync out of plain text? No audio. For text based game.
Hi @kapamees! The short answer is: No, out of the box, Rhubarb cannot animate without a sound file. Modifying it to work on dialog alone would require moderate programming skills in C++.
If you're interested in making these modifications yourself and need some guidance, feel free to create a new issue.
@DanielSWolf Thank you, no worries! Im no good at it ;)
Wow this looks great! Just wondering about language support. You have any plans to support other languages than English?
I'm planing to do LipSync for my game in Unity that has animations done in Spine (Esoteric Software).