VOICEVOX / voicevox

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのエディター
https://voicevox.hiroshiba.jp/
Other
2.42k stars 296 forks source link

First English Voice Library #542

Open Patchethium opened 2 years ago

Patchethium commented 2 years ago

Contents

@Hiroshiba

According to #498 , it might be the time to take English into consideration.

I suggest we start with the corpus LJSpeech, since it's in public domain we can literally do anything with it. As for the frontend, espenet recommends g2p, you may want to take a look at it.

I can help with the UI and Engine part if you need any, but since the recipe needed to train the core library is private, your effort is also necessary if we want to produce a functional out-of-box English TTS.

TODOs:

This debut may hopefully raise some discussions in English community and ease new libraries to come by means like crowd funding or donations.

It's a long way to go, feel free to modify the list or bring up any suggestions.

Pros

Anything you can expect from an English library

Cons

We need to write the code :(

Hiroshiba commented 2 years ago

issueありがとうございます! 僕はちょっと日本語版TTSで手一杯ですが、誰かやりたい方が居たら、サポートを考えたいです。 ということでこのissueはcloseせず、時か人が来るまでopenし続けようと思います。

new accent panel for English

UIの中でこれが最もアイデア力が試されると思います。 英語の知識があまりないのですが、おそらく日本語と違って英語は1単語の中に2箇所以上のアクセントがありうるので、納得のあるUIが必要そうです。 あと抑揚を音高ではなく強弱で表現するので、そこもどうすべきか考える必要があるかもしれません。

train a model with LJSpeech

たぶんLJSpeechであれば学習済みモデルが配布されているはずなので、trainはしなくてもとりあえず作成可能かもしれません。


Thank you for the issue! I've got my hands full with the Japanese TTS, but if anyone wants to try it, I'd like to consider supporting it. I'm not going to close this issue, but I'll keep it open until someone wants to do it.

new accent panel for English

I think this is the most challenging UI to come up with. I don't have a lot of knowledge of English, but unlike Japanese, there can be more than one accent in a word in English, so I think we need a UI that makes sense. Also, intonation is represented not by pitch but by intensity, so we may need to think about how to do that as well.

train a model with LJSpeech

Perhaps a trained model with LJSpeech should be distributed, so it may be possible to create one without training.

Patchethium commented 2 years ago

たぶんLJSpeechであれば学習済みモデルが配布されているはずなので、trainはしなくてもとりあえず作成可能かもしれません。

I don't think so... Not all of the TTS systems are focusing on allowing users to adjust the parameters themselves. Take an example of Tacotron, completely E2E, that is, I could grab a Tacotron LJSpeech pretrained model and glue it to this GUI, but users will never take advantage of adjusting parameters themselves, which makes it meaningless. Actually, VOICEVOX is the only open source repository as far as I know to have such space big enough to make a difference. So if we are to make it, we should make a VOICEVOX model. (or whatever you call the models in voicevox_core)

あと抑揚を音高ではなく強弱で表現するので、そこもどうすべきか考える必要があるかもしれません。

Although Stress replaces the Accent in Japanese, the rest of them can just stay the same. Since the stress information is already provided by g2p, I suppose it won't require much work to move the training code to English.

The interface design is pretty easy since it's just numbers representing different stress state, I may start working on it after I finished i18n.

僕はちょっと日本語版TTSで手一杯ですが、誰かやりたい方が居たら、サポートを考えたいです。

The problem here is the English speaking developers we need to produce an English library could only be gathered by producing one 😅

Anyway, it'll be slow but I CAN finish these stuffs by myself, the only thing I can't achieve is to produce a VOICEVOX model with LJSpeech, where exactly I'm asking for your help. Please consider taking a look into the subject when you have the time, meanwhile I'll be working around with other stuffs on the TODO list.

Hiroshiba commented 2 years ago

I don't think so... Not all of the TTS systems are focusing on allowing users to adjust the parameters themselves.

たしかに!! 仰るとおりで、VOICEVOXのようにユーザーが細かく調整できるようにするためには、よくあるE2E手法はダメなので、専用にtrainする必要があります。

You're right! In order to allow users to make fine tuning adjustments like VOICEVOX, the common E2E method is not possible, so it needs to be specially train.

あなたの熱意を受け取りました。 このissueはこのままopenし続けます。 英語TTSのモチベーションがある機械学習エンジニアが現れることを願っています。 I received your enthusiasm. I will keep this issue open. I hope there will be some machine learning engineers who are motivated by English TTS.

Patchethium commented 2 years ago

So are you implying that any ml engineer can already reproduce the training? I though the training recipe was private at your place...

Hiroshiba commented 2 years ago

私のコードは日本語に特化しているので、英語では全く太刀打ちできないと思います。 なのでたぶん、誰がやるにせよフルスクラッチになりそうです・・・。


My code is Japanese-specific, and I don't think my code can compete at all in English. So it's probably going to be full scratch, whoever does it...

Patchethium commented 2 years ago

I doubt that would take much work to adapt to new languages... TTS systems with deep learning seem to be mostly just vary from frontends when it comes to language difference. In this case, replacing the accent to stress information should be just enough. Maybe you can take a little deeper look at the topic? Or if I have to build it from scratch, could you share some information about how the training works?

Hiroshiba commented 2 years ago

いえいえ!

機械学習で高品質なTTSを実現しようとするのは、本当に大変です。 そもそも機械学習はデータ作成、コードの実装、訓練、チューニング、アプリケーション化の5段階くらいがあります。 仮にVOICEVOXの学習モデルを英語対応するのに課題になるのは、データ作成とチューニングです。

まずどんなデータ構造を入力するのが妥当なのかよく考えます。ここで間違えるとできることもできなくなります。 チューニングではどこが悪いのかを理屈で考え、それまでのコードやデータを見直す必要があります。 このサイクルを何度もやってやっと完成です。

僕は日本語TTSを作るのに半年かかりました。 ソースコードなどが揃っているので初期コストは下げられますが、短縮できるのはたかだか2週間くらいだと思います。👍


No, no, no!

Trying to achieve high quality TTS with machine learning is a real challenge. To begin with, machine learning has about five stages: data creation, code implementation, training, tuning, and application. If we were to adapt the VOICEVOX training model to English, the challenges would be data creation and tuning.

The first step is to think carefully about what kind of data structure is appropriate to input. If you make a mistake here, you will not be able to do what you can do. For tuning, you need to think about what is wrong with the logic and review the previous code and data. After doing this cycle many times, you are finally finished.

It took me half a year to make a Japanese TTS. The initial cost can be lowered because the source code and other information is already available, but I think it can only be reduced to about 2 weeks. 👍

Patchethium commented 2 years ago

なるほどです、詳細な解説ありがとうございます。

こうなると、もし有志な英語TTS開発者さんが来ればいいのですが、まず自分でやってみることにします。

そして、その開発者さんのためにこのissueをopenのままにしたいんですが、どうでしょう。

Hiroshiba commented 2 years ago

承知しました! このissueはopenにし続けたいと思います。

ghost commented 2 years ago

Hello! I'm no TTS developer, but I have an idea that might open VoiceVox to any language in the world.

I am an active user of DeepFaceLab (DFL), a machine learning program that replaces one person's face with another. This is also known as "deepfaking". The program is structured in such a way that anyone can create and train a model without prior programming knowledge. All the user has to do is give the AI the source and target material. Once these two pieces of data are provided, the user can prepare the data for training by extracting, enhancing, sorting, masking, and the use of other data preparation methods. At the start of training, the user can configure a variety of settings, from the type of architecture to the control of the Generative Adversarial Network (GAN).

I believe VoiceVox can incorporate some of these techniques used in DFL. Instead of having developers program and train their own models, they would design a separate program that would allow users to easily create models while also giving them more fine-tuned controls. The user can choose any sound source data and any language data they want. Models created in this separate program can be easily plugged into VoiceVox for use by other users.

The VoiceVox community, and other TTS communities like it, will then be split into two categories. You'd have people who create models and people who use the models. The users of VoiceVox don't need to learn how to create models if they don't want to. They can simply download and use community made models. This is also how the 3D animation/modeling community works. Some people create 3D model assets and others use these 3D model assets to create animations.

In conclusion, I believe VoiceVox would benefit greatly by allowing people to diversify into specific professions if they so please.

Sorry for the long explanation. I just thought this would be a good idea to implement. Thank you for reading! 🙂

———————————— Japanese Translation ————————————

こんにちは!私はTTS開発者ではありませんが、VoiceVoxを世界中のどの言語にも開放できる可能性があると考えています。

私は、ある人の顔を別の人の顔に置き換える機械学習プログラムであるDeepFaceLab(DFL)のアクティブユーザーです。これは「ディープフェイキング」とも呼ばれます。プログラムは、プログラミングの予備知識がなくても誰でもモデルを作成およびトレーニングできるように構成されています。ユーザーがしなければならないのは、AIにソースとターゲットのマテリアルを与えることだけです。これらの2つのデータが提供されると、ユーザーは、抽出、拡張、並べ替え、マスキング、およびその他のデータ準備方法を使用して、トレーニング用のデータを準備できます。トレーニングの開始時に、ユーザーはアーキテクチャのタイプから敵対的生成ネットワーク(GAN)の制御まで、さまざまな設定を構成できます。

VoiceVoxは、DFLで使用されるこれらの手法のいくつかを組み込むことができると思います。開発者に独自のモデルをプログラムしてトレーニングさせる代わりに、ユーザーがモデルを簡単に作成できると同時に、より微調整されたコントロールを提供できる別のプログラムを設計します。ユーザーは、任意の音源データと任意の言語データを選択できます。この別のプログラムで作成されたモデルは、他のユーザーが使用できるようにVoiceVoxに簡単に接続できます。

その後、VoiceVoxコミュニティおよびそのような他のTTSコミュニティは、2つのカテゴリに分割されます。モデルを作成する人とモデルを使用する人がいます。 VoiceVoxのユーザーは、必要がなければモデルの作成方法を学ぶ必要はありません。コミュニティで作成されたモデルをダウンロードして使用するだけです。これは、3Dアニメーション/モデリングコミュニティの仕組みでもあります。 3Dモデルアセットを作成する人もいれば、これらの3Dモデルアセットを使用してアニメーションを作成する人もいます。

結論として、VoiceVoxは、人々が希望すれば特定の職業に多様化できるようにすることで大きなメリットが得られると思います。

長い説明でごめんなさい。これを実装するのは良い考えだと思いました。読んでくれてありがとう! 🙂

Patchethium commented 2 years ago

Thanks for sharing your experience at DFL, such an active community sounds pretty promising.

I was developing my own model since the last time I left a comment under this issue. I'm also planning to publish my training recipe, hopefully such a community will then grow around it.

Hiroshiba commented 1 year ago

I would like to ask someone knowledgeable in English TTS!

VOICEVOX has been created with two key focuses: (1) the ability to correct synthesized speech for unknown words so that they are pronounced correctly, and (2) the ability to control the synthesized speech so that it sounds natural. Character TTS like VOICEVOX is essential for synthesizing speech that can be used in video game commentary, as games often contain many unknown words.

For Japanese, (1) corresponds to "modifying the reading and accent," and (2) corresponds to "intonation (pitch) adjustment." For English, I believe (1) involves "modifying the pronunciation symbols, including stress positions," and (2) involves "adjusting the emphasis (power)." To achieve these, I think it is necessary to (1) "obtain pronunciation symbols from English text" and (2) "obtain the duration of each syllable (or mora)."

(1) seems to be possible with tools like eSpeak, but I would like to know if that is correct. I don't know how to do (2) (for Japanese, it is possible with the phoneme alignment library Julius). If you know about this, please let me know!

Additionally, I would like to know how people use existing English TTS software. How do English speakers deal with TTS mispronunciations? For example, what do you do when you want to differentiate between "I read (/ri:d/) the book" and "I read (/rɛd/) the book"? Also, how do you handle synthesizing speech for unknown words, such as proper names?


日本語 英語のTTSに詳しい方に質問があります! VOICEVOXは2つのこだわりを持って作られています。(1)未知語を正しくなるように合成音声を修正できることと、(2)合成音声が自然になるようにコントロールできることです。VOICEVOXのようなキャラクターTTSは、ゲーム実況できるような音声を合成する能力が必須で、ゲームには未知語がとても多く含まれているためです。 日本語の場合だと(1)は「読みとアクセントの修正」、(2)は「イントネーション(ピッチ)調整」が該当します。英語の場合、(1)は「ストレス位置を含む発音記号の修正」、(2)は「強弱(パワー)の調整」だと考えています。これらの実現には、(1)のために「英語のテキストから発音記号を得ること」と、(2)のために「各々の音節(あるいはモーラ)の区間を得ること」が必要だと思います。 (1)はeSpeakなどで可能っぽいのですが、それが合っているか知りたいです。(2)は方法を知りません(日本語は音素アライメントライブラリのJuliusがあるので可能です)。このあたりご存知でしたら教えてほしいです! あと、既存の英語TTSソフトをどう使っているかも知りたいです。 英語圏の人はTTSの読み間違いをどう対処しますか? 例えば「I read(/ri:d/) the book.」と「I read(/rɛd/) the book.」にしたいとき、どうしますか? また、人名などの未知語を音声合成したいときどうしていますか?
Patchethium commented 1 year ago

Good to know you're picking up English. Unfortunately I may not be able to provide much help, since I'm currently still working on my tts.

Anyway, for [1] the phonemizer, I recommend g2p_en, for [2] the force aligner, I recommend mfa.

g2p_en uses cmudict phoneme set, in which stress is a phoneme wise identicator. For example, in IH0 and AH2, IH and AH is the phoneme and 0 and 2 is stress. As you can see, in cmudict it's very easy to encode stress into a one-hot and cat. it to phoneme vec. However, you may also need to attach some word boundary information to make words pronounced clearly.

g2p_en is written in numpy, shouldn't be a problem to embed it in voicevox engine.

I recommend you to read the docs before using mfa, you can use mfa's ARPA model, which is basically just cmudict. Luckily they provide this guide, follow it and you'll get some TextGrid files containing duration information. I myself was using an unsupervised manner in alignment, introduced in arxiv:2108.10447.

English TTS apps also suffer from the disambiguation problem in g2p (i.e the (/ri:d/) and (/rɛd/) stuff) and have different approaches. In free apps like google and microsoft azure tts free version they ignore the problem. coqui-studio and sonantic allow user to modify the phoneme, even pitch, energy and duration. Elevenlab has the best approach IMO, in which they somehow combines nlp with tts and elimates the g2p stage.

Hiroshiba commented 1 year ago

@Patchethium

Thank you so much for your detailed explanation! I completely forgot about mfa. I didn't know the documentation was so comprehensive; it's really amazing.

I've tried out several TTS services, and they've been very helpful! While coqui-studio doesn't allow for phoneme modification, it does enable adjustments to pitch and energy for individual phonemes, which seems to make it possible to change stress placement. The UI is similar to CeVIO's. Elevenlab is an end-to-end system, and while it doesn't allow user adjustments, the quality feels quite high.

I forgot to ask one crucial question. English-speaking regions seem to have highly developed end-to-end speech synthesis, but unlike Japanese Hiragana, English text doesn't have a one-to-one correspondence with pronunciation. Therefore, I think it wouldn't be easy to fix mispronunciations. Do people who actually use end-to-end TTS not find this inconvenient?


日本語 丁寧に教えて頂きありがとうございます!! `mfa`を忘れていました。ドキュメントが充実しているのは知りませんでした、とても素晴らしいです。 いくつかのTTSサービスを試しに使ってみました。とても参考になります! `coqui-studio`は音素の修正はできませんでしたが、音素ごとのピッチ・エナジー調整ができるのでストレス位置の変更はできそうでした。UIはCeVIOのものが近いですね。 `Elevenlab`はe2eで、ユーザー調整はできませんでしたが品質は高く感じました。 もう1つ肝心なことを質問していませんでした。 英語圏はe2e音声合成がかなり発達していそうですが、日本語と違ってテキスト(ひらがな)と読みが一対一対応していませんよね。なので読み間違いが発生したときに修正が簡単ではないと思います。 実際にe2e TTSを使う人は不便に感じていないのでしょうか・・・?
Patchethium commented 1 year ago

Therefore, I think it wouldn't be easy to fix mispronunciations

No, not really, kids learn words with the help of pronunciation symbols at school, like /ri:d/. We don't usually use them in daily life like hiragana in japanese, but we know how to use when necessary.

You're right that almost every English TTS app doesn't include a phoneme editing feature, but I don't think they leave the problem there just because they don't have other ways. One reason is that some of them need to sell this feature to the enterprise, like sonantics, another reason is that it's just too trivial to do so. What they focus on is speech synthesis, end user accessibility has to step aside.

For example, I do feel frustrated when Eleven labs misspells a phone number as decimal, but it can be fixed very easily by typing a space between each character. Fixing spelling is not a big issue, they can add it anytime with a few lines of code, but their cross lingual zero-shot voice cloning and extremely natural voice are certainly huge issues.

PS: Just a thought, I also started thinking the pitch/duration tuning in the same way, when the voice is not natural enough, we tune the parameters. But what if we can achieve better naturalness at the first place? I'd say Tuning/Fixing pronunciation is a sweet girl, people ask about her when she's not on the stage, but should we use her as our main vocalist? I don't think so.

Hiroshiba commented 1 year ago

Thank you for the detailed explanation! It has definitely helped me get a clearer image of English TTS!

My thoughts might be a bit outdated, but I do have some unique ideas.

What I want to create is a "character voice synthesis" system (similar to VOICEROID), with the main use case being video game commentary. In a domain with many unknown words like this, I believe it's difficult to synthesize natural-sounding speech without user adjustments. For example, in a Pokémon game commentary, there should be a lot of unknown words. I believe that an application that can correct the pronunciation of unknown words is essential for it to be on the stage.

Additionally, I want to make it usable on regular people's local computers, which means that large models can't be used. Achieving both a small model size and naturalness at the same time is challenging, so I'm considering sacrificing some default voice quality for a smaller model size. This might not be much of an issue if there's a large amount of data available.

Well, having different opinions isn't necessarily a bad thing. We can simply adjust the UI according to the voice synthesis model...!

日本語 詳細ありがとうございます!! おかげでかなり鮮明に英語TTSのイメージできるようになってきました!! ちょっと時代遅れかもしれませんが、自分にはいくつかの独自の思想があります。 自分が作りたいのは「キャラクター音声合成(VOICEROIDのような)」で、その主な利用目的はゲーム実況になります。未知語の多いこのドメインでは、ユーザーの調整無しで自然な音声を合成できないと思っています。例えばポケモンのゲーム実況だと未知語ばかりなはずです。未知語の発音を修正できるアプリケーションじゃないとステージに立てない、と思っています。 あと、普通の人のローカルパソコンで使えるようにしたいのですが、この場合大きなモデルを使えないんですよね。小さいモデルと自然さを同時に実現することは難しいので、デフォルトで作られた音声の自然さをある程度犠牲にして、モデルサイズを小さくすることを考えています。これはデータ数が多ければ意外と問題にならないかもです。 まあ、主張が異なるのは悪いことではないはずです。音声合成モデルによってUIを変えればいいだけなので・・・!
Patchethium commented 1 year ago

I'm not saying tuning isn't sweet, but just explaining why the big enterprises don't have interest in improving user accessibility. I'd definitely be happy to see an English TTS with pitch tuning, allows to edit phoneme, runs locally so I don't have to tolerate coqui-studio's 502 bad gateway.

BTW saying Pokemon, I copied a few pokemon's names into Elevenlab and it says them all very properly. I think it's because g2p in English itself is an easy task, I once made a single LSTM seq2seq model to do train on cmudict, and it already got 90% accuracy on test set.

Hiroshiba commented 1 year ago

I see... Are those recent Pokemon names that are not included in the dictionary? If so, it might be that there's a very low chance of getting them wrong in the first place.

日本語 なるほど・・・。それは(辞書に含まれていない)最近のポケモンの名称でしょうか。だとしたらそもそも間違うこと自体がとても少ないのかもしれませんね・・・。
Patchethium commented 1 year ago

Elevenlabs in deed has a good g2p model, it can even handle nonsense words I ask chatgpt to make up like Flimble and Quink, very decently. G2p in English is not a problem, as I mentioned.

Hiroshiba commented 1 year ago

That's impressive. Do you happen to know if g2p_en achieves similar accuracy? If it performs at the same level as Elevenlabs, it seems like we could lower the priority of the reading correction feature a bit.

日本語 素晴らしいです。`g2p_en`も同じくらいの精度が出るかご存知だったりしますか? `Elevenlabs`と同じくらいの性能なのであれば、読みの修正機能は少し優先度を下げても良さそうに思いました。
Patchethium commented 1 year ago

I don't think we need to spend more time on this topic. As I've said before, reading correction is always better than nothing and I am 100% supporting fitting one into the GUI. Just set off and go already, contributors will add these if they think necessary.