Plans for Rhubarb Lip Sync 2

DanielSWolf commented 3 years ago

This issue is a collection of ideas and decisions regarding Rhubarb Lip Sync 2.

Full rewrite

Rhubarb 2 will be a full rewrite rather than a series of iterative improvements over version 1.x. This is necessary because it will use a completely different technology stack (see Programming languages and Build tool).

I'm currently working on a proof of concept to make sure that the basic ideas work out. Once that's done, I'll start working towards an MVP version of Rhubarb 2. This version will not contain all features discussed here, nor even all features currently found in version 1.x. My idea is to have versions 1.x and 2.x coexist for some time, during which I'll add new features to the 2.x versions, while only fixing major bugs in the 1.x versions. Once we've reached feature parity, I'll deprecate the 1.x versions.

Multiple languages

Rhubarb 1.x only supports English dialog. Support for additional languages has often been requested, but due to a number of technical limitations, adding it to Rhubarb 1.x would require a rewrite of most of its code.

The architecture of Rhubarb 2 will be language-agnostic from the start. This means that adding more languages should be possible at any time with minimal code changes.

Graphical user interface

In addition to the CLI, Rhubarb 2 will have a GUI with the following features:

Automatic animation of individual audio files
Automatic animation of arbitrarily large batches of audio files
Manual editing of animation results

This should satisfy the following use cases:

Non-technical users often have trouble using Rhubarb as a command-line tool.
One great advantage of Papagayo compared to Rhubarb is the ability to manually tweak the results.
For development and debugging, I need a convenient way of visualizing Rhubarb's inner state. Until now, I've used a script for Vegas Pro, which has severe limitations. Using a custom GUI should make development much smoother.

Exact dialog through forced alignment

Rhubarb 1.x allows the user to specify the dialog text of a recording. However, this text is merely used to guide the speech recognition step. Due to limitations in the speech recognition engine, Rhubarb often recognizes incorrect words even if the correct words were specified.

Rhubarb 2 will allow the user to specify exact dialog that is aligned with the recording without an additional recognition step. This should have the following advantages:

It allows for perfect lip sync even with a flawed speech recognizer
It is faster than performing full speech recognition
It is closer to the mental model employed by Papagayo, which should ease the transition for Papagayo users.

Mouth shapes

Rhubarb 1.x supports 6 basic mouth shapes and up to 3 extended mouth shapes, all of which are pre-defined. Rhubarb 2 will still rely on pre-defined mouth shapes and will use the same 6 basic mouth shapes. However, I'm planning to increase the number of supported extended mouth shapes. This will allow for smoother lip sync animation if desired.

Currently, mouth shapes are named using single letters. This is based on the tradition of hand-written exposure sheets, but may be unnecessarily cryptic in its digital form. I'm thinking about adopting a more intuitive naming scheme, similar to the visemes used by Amazon Polly. Desirable features for this naming scheme:

The mouth shape names are based on typical sounds made with them. For instance, the mouth shape A (closed mouth) could be named something like MBP (Preston Blair / Papagayo), p (Amazon Polly), m, or similar.
The naming scheme should allow wider and more narrow versions of mouth shapes. For instance, a narrow-mouthed closed mouth might be called m- or similar.
The naming scheme should allow for mouth shapes that only exist as tweens, not as visemes in their own right. For example, a mouth shape used as inbetween between closed and half-open mouth might be called something like m_to_e.
The naming scheme should only use characters that fulfill all of the following requirements:
- They can be easily typed on a US keyboard
- They are ASCII characters
- They are valid file name characters on all major operating systems
- They are valid resource name characters in a number of relevant programs, such as Photoshop, After Effects, Moho, OpenToonz, Spine, Blender, etc.

Eventual support for keyframe (3D) animation

Rhubarb 1.x only supports limited animation (also known as replacement animation), which holds each mouth shape until it is replaced by the next one. This approach is a good fit for most 2D animation, but is ill-suited for 3D animation or mesh-based 2D animation.

Rhubarb 2 will be designed in a way that allows keyframe-based export to be added at a later time. However, this feature has low priority and won't be included in the first versions.

CLI changes

Here is an incomplete list of probable changes to Rhubarb's CLI:

Options will change from Pascal case (--extendedShapes) to kebab case (--extended-shapes). This format is much more common.
The mouth shape names will change (see Mouth shapes).

Programming languages

Rhubarb 1.x is written in C++. While this language is both powerful and efficient, it has a number of severe shortcomings. Most notably, it requires a lot of boilerplate code and it's very easy to make unnoticed mistakes such as overriding the wrong special member functions or using the wrong mechanism for passing arguments.

After a lot of research, I've decided to use Kotlin as the main programming language, with some C++ code for performance-critical operations and third-party libraries. Below is a feature matrix I created for the three hottest contenders, Kotlin, Go, and Rust. Empty cells indicate that I didn't investigate an aspect for the given language.

tl;dr: Kotlin has all the features I was looking for. Rust was a very strong contender, but I wanted a modern, React-style UI framework for the GUI, which excluded Rust. Go looked promising at the start, but revealed numerous weaknesses on closer inspection.

Edit:

After additional research, I've decided to go with Rust as a programming language. The main argument against Rust was the lack of good GUI frameworks, which now exist. On the whole, Rust feels much more natural for the kind of program I'm writing:

Being a system programming language (like C/C++), Rust allows me to write optimized performance-critical code in the same language as the rest of the program. With Kotlin, I would have to write those parts in C++.
The Rust community offers crates (libraries) for many low-level data processing tasks that I would have to write myself if using Kotlin.
Where interop with C/C++ is still necessary, Rust makes it easier. And due to the lack of marshalling, interop is more direct and thus faster.
Rust compiles to native machine code rather than bytecode. This means that users don't have to install a runtime environment. Also, the startup time is shorter, which is important for a CLI tool.

Feature	Kotlin	Go	Rust
Community
Language is actively developed	✔	✔	✔¹
Large, growing number of packages	✔²	?	✔
Language
Intuitive syntax	✔	❌	✔
Type safety	✔	✔	✔
Null safety	✔	❌	✔
Concise syntax	✔		(✔)
Concise lambdas	✔	❌	✔
Smart casts	✔
Support for immutability	✔		✔
Generics	✔	❌	✔
High-performance			✔
Easy and efficient interop with C/C✔✔	(✔)		✔³
Libraries
Support for multi-threading	✔		✔
Low-level number arrays of different types	✔		✔
Chaining map/reduce	✔		✔
Modern GUI library	✔⁴		✔⁵
Build system
Self-bootstrapping	✔
Integrates native code	(✔)
Integrates external Git repos	✔
Integrates NPM	✔
Ecosystem
Robust package management	✔	(✔)⁶	✔
Full-featured IDE	✔		?
Targets Windows, macOS, and Linux	✔	✔	✔
No VM or VM can be bundled	✔	✔	✔

¹ In 2020, most of the Rust team was laid off (see Wikipedia). Since then, a Rust Foundation has been founded, and all major IT companies have joined.

² Including Java packages.

³ Using CXX

⁴ Using JetBrain's brand new Compose for Desktop

⁵ The overview site Are we GUI yet is sadly outdated. There are, in fact, several viable options; I'm currently leaning towards egui.

⁶ Doesn't seem to be as robust as NPM.

Build tool

I've chosen Gradle as the build tool for Rhubarb. It fulfils the following requirements:

Works identically on Windows, macOS, and Linux (this excluded any Docker-based solutions)
Minimal requirements: Gradle requires just a JDK for bootstrapping. Almost all dependencies can be installed automatically and locally.
Powerful: Gradle uses a Kotlin DSL and can run Node scripts, which allows me to perform just about any kind of processing.

Speech processing

For Voice Activity Detection (VAD), I'll again use a portion of the Chromium WebRTC code.
For G2P, I plan to use Phonetisaurus. This is one of the few solutions that don't require Python or a similar scripting language at runtime.
For forced alignment, I plan to use the Montreal Forced Aligner (MFA). This seems to be the most popular solution. MFA consists of a number of Python scripts around Kaldi, so I hope that after training, I'll be able to call Kaldi directly without having to supply a Python environment.
For speech recognition, I plan to use use Kaldi. It's more modern than PocketSphinx and should give better results. Plus, I need to include it anyway for the Montread Forced Aligner. My goal is to build Kaldi as part of the Rhubarb build and use it as an in-process library, as opposed to shipping with pre-built binaries and calling them as external processes. My guess is that getting it integrated in the build will be one of the larger challenges.

bilck commented 3 years ago

@DanielSWolf Any plans for Unity 3D support? I could help with that.

DanielSWolf commented 3 years ago

Once Rhubarb supports keyframe animation, I'd love to add plugins for various 3D packages, similar to the way it currently supports 2D tools. When that time comes, I'll be happy about any support. Given how little free time I have, however, it will probably take me several years to get there.

bilck commented 3 years ago

For Unity 3D, having some abstract Timeline support with the right phoneme to play (and weights) would be more than enough for most devs to implement according to their own needs.

We, for example, use Unity Spine SDK combined with skin composition for visemes and expressions.

madhephaestus commented 1 year ago

As a Java Developer I am very excited for a JVM compatible version.

The Vosk stack might be a good substitution for Sphinx. I am using it in Java right now and am getting very good results. There is even a PR adding phoneme labels and timestamps to the data stream. https://github.com/alphacep/vosk-api/pull/528

I have recently just written a wrapper around the published executables in Java. I am building real-time robotic interaction software. I am using Rhubarb for real time TTS -> audio -> Rhubarb -> Synced animation+audio .

I cam here to just ask for a live update of any Viseme's detected as they are found. If i had live updates i could much more tightly synchronize the initiation so speech, and the execution of that speech.

Moving forward, would it be possible to conciser if live updates is a feature worth adding?

madhephaestus commented 1 year ago

In case other Java developers get this far and are dissapointed that it seems that rubarb 2 will be in Rust. I was able to dublicate the functionality of the basic Rubarb viseme generation using Vosk. I have a small stand alone example for you! https://github.com/madhephaestus/TextToSpeechASDRTest.git I was able to use the partial results with the word timing to calculate the timing of the phonemems (after looking up the phonemes in a phoneme dictionary). I then down-mapped the phonemes to visemes and stored the visemes in a list with timestamps. The timestamped visemes process in a static 200ms, and then the audio can begin playing with the mouth movemets synchronized precisly with the phoneme start times precomputed ahead of time. This is compaired to Rubarb which takes as long to run as the audio file is long. This is a complete implementation for my uses, so if anyone else needs lip-syncing in java, have a look at that example.

DanielSWolf commented 1 year ago

Sorry for the late reply. Rhubarb simply isn't designed for real-time applications; see e.g. #22. I'm glad you found a working solution though!

jason-shen commented 4 months ago

just came across this, its really cool nice work, i think if version 2 is gonna be in rust, to make this into a realtime lib shouldn't be very hard, instead of having it write to a file, if it just returns the json, you can go a very long way with this. just wondering is there a eta for v2?

kunibald413 commented 1 month ago

Also looking forward to v2, this is a very useful tool already. If it can do streaming/chunking of some sort, that would be totally amazing! The detailed documentation is very much appreciated.

DanielSWolf / rhubarb-lip-sync