DanielSWolf / rhubarb-lip-sync

Rhubarb Lip Sync is a command-line tool that automatically creates 2D mouth animation from voice recordings. You can use it for characters in computer games, in animated cartoons, or in any other project that requires animating mouths based on existing recordings.
Other
1.72k stars 208 forks source link

Plans for Rhubarb Lip Sync 2 #95

Open DanielSWolf opened 3 years ago

DanielSWolf commented 3 years ago

This issue is a collection of ideas and decisions regarding Rhubarb Lip Sync 2.

Full rewrite

Rhubarb 2 will be a full rewrite rather than a series of iterative improvements over version 1.x. This is necessary because it will use a completely different technology stack (see Programming languages and Build tool).

I'm currently working on a proof of concept to make sure that the basic ideas work out. Once that's done, I'll start working towards an MVP version of Rhubarb 2. This version will not contain all features discussed here, nor even all features currently found in version 1.x. My idea is to have versions 1.x and 2.x coexist for some time, during which I'll add new features to the 2.x versions, while only fixing major bugs in the 1.x versions. Once we've reached feature parity, I'll deprecate the 1.x versions.

Multiple languages

Rhubarb 1.x only supports English dialog. Support for additional languages has often been requested, but due to a number of technical limitations, adding it to Rhubarb 1.x would require a rewrite of most of its code.

The architecture of Rhubarb 2 will be language-agnostic from the start. This means that adding more languages should be possible at any time with minimal code changes.

Graphical user interface

In addition to the CLI, Rhubarb 2 will have a GUI with the following features:

This should satisfy the following use cases:

Exact dialog through forced alignment

Rhubarb 1.x allows the user to specify the dialog text of a recording. However, this text is merely used to guide the speech recognition step. Due to limitations in the speech recognition engine, Rhubarb often recognizes incorrect words even if the correct words were specified.

Rhubarb 2 will allow the user to specify exact dialog that is aligned with the recording without an additional recognition step. This should have the following advantages:

Mouth shapes

Rhubarb 1.x supports 6 basic mouth shapes and up to 3 extended mouth shapes, all of which are pre-defined. Rhubarb 2 will still rely on pre-defined mouth shapes and will use the same 6 basic mouth shapes. However, I'm planning to increase the number of supported extended mouth shapes. This will allow for smoother lip sync animation if desired.

Currently, mouth shapes are named using single letters. This is based on the tradition of hand-written exposure sheets, but may be unnecessarily cryptic in its digital form. I'm thinking about adopting a more intuitive naming scheme, similar to the visemes used by Amazon Polly. Desirable features for this naming scheme:

Eventual support for keyframe (3D) animation

Rhubarb 1.x only supports limited animation (also known as replacement animation), which holds each mouth shape until it is replaced by the next one. This approach is a good fit for most 2D animation, but is ill-suited for 3D animation or mesh-based 2D animation.

Rhubarb 2 will be designed in a way that allows keyframe-based export to be added at a later time. However, this feature has low priority and won't be included in the first versions.

CLI changes

Here is an incomplete list of probable changes to Rhubarb's CLI:

Programming languages

Rhubarb 1.x is written in C++. While this language is both powerful and efficient, it has a number of severe shortcomings. Most notably, it requires a lot of boilerplate code and it's very easy to make unnoticed mistakes such as overriding the wrong special member functions or using the wrong mechanism for passing arguments.

After a lot of research, I've decided to use Kotlin as the main programming language, with some C++ code for performance-critical operations and third-party libraries. Below is a feature matrix I created for the three hottest contenders, Kotlin, Go, and Rust. Empty cells indicate that I didn't investigate an aspect for the given language.

tl;dr: Kotlin has all the features I was looking for. Rust was a very strong contender, but I wanted a modern, React-style UI framework for the GUI, which excluded Rust. Go looked promising at the start, but revealed numerous weaknesses on closer inspection.

Edit:

After additional research, I've decided to go with Rust as a programming language. The main argument against Rust was the lack of good GUI frameworks, which now exist. On the whole, Rust feels much more natural for the kind of program I'm writing:

Feature Kotlin Go Rust
Community
Language is actively developed ✔¹
Large, growing number of packages ✔² ?
Language
Intuitive syntax
Type safety
Null safety
Concise syntax (✔)
Concise lambdas
Smart casts
Support for immutability
Generics
High-performance
Easy and efficient interop with C/C✔✔ (✔) ✔³
Libraries
Support for multi-threading
Low-level number arrays of different types
Chaining map/reduce
Modern GUI library ✔⁴ ✔⁵
Build system
Self-bootstrapping
Integrates native code (✔)
Integrates external Git repos
Integrates NPM
Ecosystem
Robust package management (✔)⁶
Full-featured IDE ?
Targets Windows, macOS, and Linux
No VM or VM can be bundled

¹ In 2020, most of the Rust team was laid off (see Wikipedia). Since then, a Rust Foundation has been founded, and all major IT companies have joined.

² Including Java packages.

³ Using CXX

⁴ Using JetBrain's brand new Compose for Desktop

⁵ The overview site Are we GUI yet is sadly outdated. There are, in fact, several viable options; I'm currently leaning towards egui.

⁶ Doesn't seem to be as robust as NPM.

Build tool

I've chosen Gradle as the build tool for Rhubarb. It fulfils the following requirements:

Speech processing

bilck commented 3 years ago

@DanielSWolf Any plans for Unity 3D support? I could help with that.

DanielSWolf commented 3 years ago

Once Rhubarb supports keyframe animation, I'd love to add plugins for various 3D packages, similar to the way it currently supports 2D tools. When that time comes, I'll be happy about any support. Given how little free time I have, however, it will probably take me several years to get there.

bilck commented 3 years ago

For Unity 3D, having some abstract Timeline support with the right phoneme to play (and weights) would be more than enough for most devs to implement according to their own needs.

We, for example, use Unity Spine SDK combined with skin composition for visemes and expressions.

madhephaestus commented 1 year ago

As a Java Developer I am very excited for a JVM compatible version.

The Vosk stack might be a good substitution for Sphinx. I am using it in Java right now and am getting very good results. There is even a PR adding phoneme labels and timestamps to the data stream. https://github.com/alphacep/vosk-api/pull/528

I have recently just written a wrapper around the published executables in Java. I am building real-time robotic interaction software. I am using Rhubarb for real time TTS -> audio -> Rhubarb -> Synced animation+audio .

I cam here to just ask for a live update of any Viseme's detected as they are found. If i had live updates i could much more tightly synchronize the initiation so speech, and the execution of that speech.

Moving forward, would it be possible to conciser if live updates is a feature worth adding?

madhephaestus commented 1 year ago

In case other Java developers get this far and are dissapointed that it seems that rubarb 2 will be in Rust. I was able to dublicate the functionality of the basic Rubarb viseme generation using Vosk. I have a small stand alone example for you! https://github.com/madhephaestus/TextToSpeechASDRTest.git I was able to use the partial results with the word timing to calculate the timing of the phonemems (after looking up the phonemes in a phoneme dictionary). I then down-mapped the phonemes to visemes and stored the visemes in a list with timestamps. The timestamped visemes process in a static 200ms, and then the audio can begin playing with the mouth movemets synchronized precisly with the phoneme start times precomputed ahead of time. This is compaired to Rubarb which takes as long to run as the audio file is long. This is a complete implementation for my uses, so if anyone else needs lip-syncing in java, have a look at that example.

DanielSWolf commented 1 year ago

Sorry for the late reply. Rhubarb simply isn't designed for real-time applications; see e.g. #22. I'm glad you found a working solution though!

jason-shen commented 4 months ago

just came across this, its really cool nice work, i think if version 2 is gonna be in rust, to make this into a realtime lib shouldn't be very hard, instead of having it write to a file, if it just returns the json, you can go a very long way with this. just wondering is there a eta for v2?

kunibald413 commented 1 month ago

Also looking forward to v2, this is a very useful tool already. If it can do streaming/chunking of some sort, that would be totally amazing! The detailed documentation is very much appreciated.