Open Patchethium opened 2 years ago
issueありがとうございます! 僕はちょっと日本語版TTSで手一杯ですが、誰かやりたい方が居たら、サポートを考えたいです。 ということでこのissueはcloseせず、時か人が来るまでopenし続けようと思います。
new accent panel for English
UIの中でこれが最もアイデア力が試されると思います。 英語の知識があまりないのですが、おそらく日本語と違って英語は1単語の中に2箇所以上のアクセントがありうるので、納得のあるUIが必要そうです。 あと抑揚を音高ではなく強弱で表現するので、そこもどうすべきか考える必要があるかもしれません。
train a model with LJSpeech
たぶんLJSpeechであれば学習済みモデルが配布されているはずなので、trainはしなくてもとりあえず作成可能かもしれません。
Thank you for the issue! I've got my hands full with the Japanese TTS, but if anyone wants to try it, I'd like to consider supporting it. I'm not going to close this issue, but I'll keep it open until someone wants to do it.
new accent panel for English
I think this is the most challenging UI to come up with. I don't have a lot of knowledge of English, but unlike Japanese, there can be more than one accent in a word in English, so I think we need a UI that makes sense. Also, intonation is represented not by pitch but by intensity, so we may need to think about how to do that as well.
train a model with LJSpeech
Perhaps a trained model with LJSpeech should be distributed, so it may be possible to create one without training.
たぶんLJSpeechであれば学習済みモデルが配布されているはずなので、trainはしなくてもとりあえず作成可能かもしれません。
I don't think so... Not all of the TTS systems are focusing on allowing users to adjust the parameters themselves. Take an example of Tacotron, completely E2E, that is, I could grab a Tacotron LJSpeech pretrained model and glue it to this GUI, but users will never take advantage of adjusting parameters themselves, which makes it meaningless. Actually, VOICEVOX is the only open source repository as far as I know to have such space big enough to make a difference. So if we are to make it, we should make a VOICEVOX model. (or whatever you call the models in voicevox_core)
あと抑揚を音高ではなく強弱で表現するので、そこもどうすべきか考える必要があるかもしれません。
Although Stress replaces the Accent in Japanese, the rest of them can just stay the same. Since the stress information is already provided by g2p, I suppose it won't require much work to move the training code to English.
The interface design is pretty easy since it's just numbers representing different stress state, I may start working on it after I finished i18n.
僕はちょっと日本語版TTSで手一杯ですが、誰かやりたい方が居たら、サポートを考えたいです。
The problem here is the English speaking developers we need to produce an English library could only be gathered by producing one 😅
Anyway, it'll be slow but I CAN finish these stuffs by myself, the only thing I can't achieve is to produce a VOICEVOX model with LJSpeech, where exactly I'm asking for your help. Please consider taking a look into the subject when you have the time, meanwhile I'll be working around with other stuffs on the TODO list.
I don't think so... Not all of the TTS systems are focusing on allowing users to adjust the parameters themselves.
たしかに!! 仰るとおりで、VOICEVOXのようにユーザーが細かく調整できるようにするためには、よくあるE2E手法はダメなので、専用にtrainする必要があります。
You're right! In order to allow users to make fine tuning adjustments like VOICEVOX, the common E2E method is not possible, so it needs to be specially train.
あなたの熱意を受け取りました。 このissueはこのままopenし続けます。 英語TTSのモチベーションがある機械学習エンジニアが現れることを願っています。 I received your enthusiasm. I will keep this issue open. I hope there will be some machine learning engineers who are motivated by English TTS.
So are you implying that any ml engineer can already reproduce the training? I though the training recipe was private at your place...
私のコードは日本語に特化しているので、英語では全く太刀打ちできないと思います。 なのでたぶん、誰がやるにせよフルスクラッチになりそうです・・・。
My code is Japanese-specific, and I don't think my code can compete at all in English. So it's probably going to be full scratch, whoever does it...
I doubt that would take much work to adapt to new languages... TTS systems with deep learning seem to be mostly just vary from frontends when it comes to language difference. In this case, replacing the accent to stress information should be just enough. Maybe you can take a little deeper look at the topic? Or if I have to build it from scratch, could you share some information about how the training works?
いえいえ!
機械学習で高品質なTTSを実現しようとするのは、本当に大変です。 そもそも機械学習はデータ作成、コードの実装、訓練、チューニング、アプリケーション化の5段階くらいがあります。 仮にVOICEVOXの学習モデルを英語対応するのに課題になるのは、データ作成とチューニングです。
まずどんなデータ構造を入力するのが妥当なのかよく考えます。ここで間違えるとできることもできなくなります。 チューニングではどこが悪いのかを理屈で考え、それまでのコードやデータを見直す必要があります。 このサイクルを何度もやってやっと完成です。
僕は日本語TTSを作るのに半年かかりました。 ソースコードなどが揃っているので初期コストは下げられますが、短縮できるのはたかだか2週間くらいだと思います。👍
No, no, no!
Trying to achieve high quality TTS with machine learning is a real challenge. To begin with, machine learning has about five stages: data creation, code implementation, training, tuning, and application. If we were to adapt the VOICEVOX training model to English, the challenges would be data creation and tuning.
The first step is to think carefully about what kind of data structure is appropriate to input. If you make a mistake here, you will not be able to do what you can do. For tuning, you need to think about what is wrong with the logic and review the previous code and data. After doing this cycle many times, you are finally finished.
It took me half a year to make a Japanese TTS. The initial cost can be lowered because the source code and other information is already available, but I think it can only be reduced to about 2 weeks. 👍
なるほどです、詳細な解説ありがとうございます。
こうなると、もし有志な英語TTS開発者さんが来ればいいのですが、まず自分でやってみることにします。
そして、その開発者さんのためにこのissueをopenのままにしたいんですが、どうでしょう。
承知しました! このissueはopenにし続けたいと思います。
Hello! I'm no TTS developer, but I have an idea that might open VoiceVox to any language in the world.
I am an active user of DeepFaceLab (DFL), a machine learning program that replaces one person's face with another. This is also known as "deepfaking". The program is structured in such a way that anyone can create and train a model without prior programming knowledge. All the user has to do is give the AI the source and target material. Once these two pieces of data are provided, the user can prepare the data for training by extracting, enhancing, sorting, masking, and the use of other data preparation methods. At the start of training, the user can configure a variety of settings, from the type of architecture to the control of the Generative Adversarial Network (GAN).
I believe VoiceVox can incorporate some of these techniques used in DFL. Instead of having developers program and train their own models, they would design a separate program that would allow users to easily create models while also giving them more fine-tuned controls. The user can choose any sound source data and any language data they want. Models created in this separate program can be easily plugged into VoiceVox for use by other users.
The VoiceVox community, and other TTS communities like it, will then be split into two categories. You'd have people who create models and people who use the models. The users of VoiceVox don't need to learn how to create models if they don't want to. They can simply download and use community made models. This is also how the 3D animation/modeling community works. Some people create 3D model assets and others use these 3D model assets to create animations.
In conclusion, I believe VoiceVox would benefit greatly by allowing people to diversify into specific professions if they so please.
Sorry for the long explanation. I just thought this would be a good idea to implement. Thank you for reading! 🙂
———————————— Japanese Translation ————————————
こんにちは!私はTTS開発者ではありませんが、VoiceVoxを世界中のどの言語にも開放できる可能性があると考えています。
私は、ある人の顔を別の人の顔に置き換える機械学習プログラムであるDeepFaceLab(DFL)のアクティブユーザーです。これは「ディープフェイキング」とも呼ばれます。プログラムは、プログラミングの予備知識がなくても誰でもモデルを作成およびトレーニングできるように構成されています。ユーザーがしなければならないのは、AIにソースとターゲットのマテリアルを与えることだけです。これらの2つのデータが提供されると、ユーザーは、抽出、拡張、並べ替え、マスキング、およびその他のデータ準備方法を使用して、トレーニング用のデータを準備できます。トレーニングの開始時に、ユーザーはアーキテクチャのタイプから敵対的生成ネットワーク(GAN)の制御まで、さまざまな設定を構成できます。
VoiceVoxは、DFLで使用されるこれらの手法のいくつかを組み込むことができると思います。開発者に独自のモデルをプログラムしてトレーニングさせる代わりに、ユーザーがモデルを簡単に作成できると同時に、より微調整されたコントロールを提供できる別のプログラムを設計します。ユーザーは、任意の音源データと任意の言語データを選択できます。この別のプログラムで作成されたモデルは、他のユーザーが使用できるようにVoiceVoxに簡単に接続できます。
その後、VoiceVoxコミュニティおよびそのような他のTTSコミュニティは、2つのカテゴリに分割されます。モデルを作成する人とモデルを使用する人がいます。 VoiceVoxのユーザーは、必要がなければモデルの作成方法を学ぶ必要はありません。コミュニティで作成されたモデルをダウンロードして使用するだけです。これは、3Dアニメーション/モデリングコミュニティの仕組みでもあります。 3Dモデルアセットを作成する人もいれば、これらの3Dモデルアセットを使用してアニメーションを作成する人もいます。
結論として、VoiceVoxは、人々が希望すれば特定の職業に多様化できるようにすることで大きなメリットが得られると思います。
長い説明でごめんなさい。これを実装するのは良い考えだと思いました。読んでくれてありがとう! 🙂
Thanks for sharing your experience at DFL, such an active community sounds pretty promising.
I was developing my own model since the last time I left a comment under this issue. I'm also planning to publish my training recipe, hopefully such a community will then grow around it.
I would like to ask someone knowledgeable in English TTS!
VOICEVOX has been created with two key focuses: (1) the ability to correct synthesized speech for unknown words so that they are pronounced correctly, and (2) the ability to control the synthesized speech so that it sounds natural. Character TTS like VOICEVOX is essential for synthesizing speech that can be used in video game commentary, as games often contain many unknown words.
For Japanese, (1) corresponds to "modifying the reading and accent," and (2) corresponds to "intonation (pitch) adjustment." For English, I believe (1) involves "modifying the pronunciation symbols, including stress positions," and (2) involves "adjusting the emphasis (power)." To achieve these, I think it is necessary to (1) "obtain pronunciation symbols from English text" and (2) "obtain the duration of each syllable (or mora)."
(1) seems to be possible with tools like eSpeak, but I would like to know if that is correct. I don't know how to do (2) (for Japanese, it is possible with the phoneme alignment library Julius). If you know about this, please let me know!
Additionally, I would like to know how people use existing English TTS software. How do English speakers deal with TTS mispronunciations? For example, what do you do when you want to differentiate between "I read (/ri:d/) the book" and "I read (/rɛd/) the book"? Also, how do you handle synthesizing speech for unknown words, such as proper names?
Good to know you're picking up English. Unfortunately I may not be able to provide much help, since I'm currently still working on my tts.
Anyway, for [1] the phonemizer, I recommend g2p_en, for [2] the force aligner, I recommend mfa.
g2p_en
uses cmudict
phoneme set, in which stress is a phoneme wise identicator. For example, in IH0
and AH2
, IH
and AH
is the phoneme and 0
and 2
is stress. As you can see, in cmudict
it's very easy to encode stress into a one-hot and cat. it to phoneme vec. However, you may also need to attach some word boundary information to make words pronounced clearly.
g2p_en is written in numpy, shouldn't be a problem to embed it in voicevox engine.
I recommend you to read the docs before using mfa
, you can use mfa's ARPA
model, which is basically just cmudict
. Luckily they provide this guide, follow it and you'll get some TextGrid files containing duration information. I myself was using an unsupervised manner in alignment, introduced in arxiv:2108.10447.
English TTS apps also suffer from the disambiguation problem in g2p (i.e the (/ri:d/) and (/rɛd/) stuff) and have different approaches. In free apps like google and microsoft azure tts free version they ignore the problem. coqui-studio and sonantic allow user to modify the phoneme, even pitch, energy and duration. Elevenlab has the best approach IMO, in which they somehow combines nlp with tts and elimates the g2p stage.
@Patchethium
Thank you so much for your detailed explanation! I completely forgot about mfa. I didn't know the documentation was so comprehensive; it's really amazing.
I've tried out several TTS services, and they've been very helpful!
While coqui-studio
doesn't allow for phoneme modification, it does enable adjustments to pitch and energy for individual phonemes, which seems to make it possible to change stress placement. The UI is similar to CeVIO's.
Elevenlab
is an end-to-end system, and while it doesn't allow user adjustments, the quality feels quite high.
I forgot to ask one crucial question. English-speaking regions seem to have highly developed end-to-end speech synthesis, but unlike Japanese Hiragana, English text doesn't have a one-to-one correspondence with pronunciation. Therefore, I think it wouldn't be easy to fix mispronunciations. Do people who actually use end-to-end TTS not find this inconvenient?
Therefore, I think it wouldn't be easy to fix mispronunciations
No, not really, kids learn words with the help of pronunciation symbols at school, like /ri:d/. We don't usually use them in daily life like hiragana in japanese, but we know how to use when necessary.
You're right that almost every English TTS app doesn't include a phoneme editing feature, but I don't think they leave the problem there just because they don't have other ways. One reason is that some of them need to sell this feature to the enterprise, like sonantics
, another reason is that it's just too trivial to do so. What they focus on is speech synthesis, end user accessibility has to step aside.
For example, I do feel frustrated when Eleven labs misspells a phone number as decimal, but it can be fixed very easily by typing a space between each character. Fixing spelling is not a big issue, they can add it anytime with a few lines of code, but their cross lingual zero-shot voice cloning and extremely natural voice are certainly huge issues.
PS: Just a thought, I also started thinking the pitch/duration tuning in the same way, when the voice is not natural enough, we tune the parameters. But what if we can achieve better naturalness at the first place? I'd say Tuning/Fixing pronunciation
is a sweet girl, people ask about her when she's not on the stage, but should we use her as our main vocalist? I don't think so.
Thank you for the detailed explanation! It has definitely helped me get a clearer image of English TTS!
My thoughts might be a bit outdated, but I do have some unique ideas.
What I want to create is a "character voice synthesis" system (similar to VOICEROID), with the main use case being video game commentary. In a domain with many unknown words like this, I believe it's difficult to synthesize natural-sounding speech without user adjustments. For example, in a Pokémon game commentary, there should be a lot of unknown words. I believe that an application that can correct the pronunciation of unknown words is essential for it to be on the stage.
Additionally, I want to make it usable on regular people's local computers, which means that large models can't be used. Achieving both a small model size and naturalness at the same time is challenging, so I'm considering sacrificing some default voice quality for a smaller model size. This might not be much of an issue if there's a large amount of data available.
Well, having different opinions isn't necessarily a bad thing. We can simply adjust the UI according to the voice synthesis model...!
I'm not saying tuning isn't sweet, but just explaining why the big enterprises don't have interest in improving user accessibility. I'd definitely be happy to see an English TTS with pitch tuning, allows to edit phoneme, runs locally so I don't have to tolerate coqui-studio
's 502 bad gateway.
BTW saying Pokemon, I copied a few pokemon's names into Elevenlab and it says them all very properly. I think it's because g2p in English itself is an easy task, I once made a single LSTM seq2seq model to do train on cmudict, and it already got 90% accuracy on test set.
I see... Are those recent Pokemon names that are not included in the dictionary? If so, it might be that there's a very low chance of getting them wrong in the first place.
Elevenlabs
in deed has a good g2p model, it can even handle nonsense words I ask chatgpt to make up like Flimble
and Quink
, very decently. G2p in English is not a problem, as I mentioned.
That's impressive. Do you happen to know if g2p_en
achieves similar accuracy? If it performs at the same level as Elevenlabs
, it seems like we could lower the priority of the reading correction feature a bit.
I don't think we need to spend more time on this topic. As I've said before, reading correction is always better than nothing and I am 100% supporting fitting one into the GUI. Just set off and go already, contributors will add these if they think necessary.
Contents
@Hiroshiba
According to #498 , it might be the time to take English into consideration.
I suggest we start with the corpus LJSpeech, since it's in public domain we can literally do anything with it. As for the frontend, espenet recommends g2p, you may want to take a look at it.
I can help with the UI and Engine part if you need any, but since the recipe needed to train the core library is private, your effort is also necessary if we want to produce a functional out-of-box English TTS.
TODOs:
This debut may hopefully raise some discussions in English community and ease new libraries to come by means like crowd funding or donations.
It's a long way to go, feel free to modify the list or bring up any suggestions.
Pros
Anything you can expect from an English library
Cons
We need to write the code :(