DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.43k stars 160 forks source link

Creating new text to IPA encoder. Does the existing model setup have place holders for IPA tone markers? #10

Closed michael-conrad closed 2 years ago

michael-conrad commented 2 years ago

Looking to take advantage of the wonder work y'all have done.

In regards to creating a new text to IPA encoder. Does the existing model embedding have place holders for the full IPA character setup including the IPA standard tone markers?

Flux9665 commented 2 years ago

Unfortunately not, since tone markers affect their surroundings, but are not separate units with their own interval on the time axis. I wasn't sure what the best way of incorporating them would be. Supporting tonal languages is planned for a future version, but I'm not sure when that would be. Probably around October. You're welcome to contribute a solution though, if you find the time :)

Here's how I would do it:

michael-conrad commented 2 years ago

Wouldn't pitch and stress marks fall into the same category?

Flux9665 commented 2 years ago

Correct, explicit pitch and stress marks would work the same way. Anything supra-segmental that doesn't have its own section on the timeaxis but rather modifies it's surroundings is a bit difficult to represent in TTS. For autoregressive models, they mess with the monotonic alignment. And for models with explicit duration prediction, they mess with getting the gold-durations. Those problems can be solved, but I found that it works very well without them and the model learns e.g. lexical stress implicitly. Also, I found that espeak is pretty good with phonemes, but pretty bad with lexical stress in many languages. So anything supra-segmental is currently just stripped from the inputs. The only exception are !, ? and ., because they usually coincide with a pause, which does appear as its own segment on the time axis.

If the tone was derived from rules, I guess the model could also learn it implicitly, but since it isn't I don't think it would work without representing it some way in the input. The articulatory features actually make that kind of easier I would say, since the tone can just be added as another feature type for each unit.

michael-conrad commented 2 years ago

I generally consider espeak bad. Tried to add Cherokee to it. Meh.

What if the tones were treated as being zero-length in duration?

Or simply encode each vowel + tone combination as a different symbol?

The instructions you gave me are beyond my ability to follow btw.

Flux9665 commented 2 years ago

One of the design goals of the toolkit was modularity, so everything is basically a wireframe with my personal favorite functioning components, but you can exchange those fairly easily. If you want to use a different text frontend, you can just exchange the text frontend as a whole, as long as the new text frontend has a function to turn a string of text into a string of phonemes and a function to turn a string of phonemes into a sequence of features for each phone. So exchanging espeak for something different should be doable with minimal effort.

Tones with zero-length in duration is a good idea, I planned on doing that with word boundaries, since they are not present in speech, but they are kind of important for text. Having sequences without zero-length markers for the alignment, but keeping track of their positions to add them back in with manual durations of 0 for the TTS should work, since the tone information gets mixed into the phones in the encoder and then removed during the upsampling from phones to spectrogram frames. It would be less elegant than including them directly into the feature vector, but might be simpler. I'll think about which of the two options I'll do when I have the time to implement it.

encoding vowel + tone combinations as unique symbols would be functionally equivalent to adding the tone to the feature vector, just a different way to get there. In the end, there would be multiple feature vectors which are identical except for the tone dimension.

michael-conrad commented 2 years ago

It has occurred to me there is also the cadence / vowel length marker to take into account. Some languages use this as a grammatical marker / word differentiation feature.

In Cherokee, as an example, one would say "ji:gowhtiha" for "I see him, her", but would shorten the leading "i" and say "jigowhtiha" for "I see it."

Only the vowel duration is affected, not any other feature.

michael-conrad commented 2 years ago

I can't help but think that using the byte values of each char as an int value to tensorize would work better. This would solve the deterministic issue and also increase the flexibility by handling of additional IPA letters, punctuation, etc, automatically.

vectors: list = list()
for ch in "string":
    vector = torch.LongTensor([int.from_bytes(ch.encode("UTF-8"), byteorder="little")])
    vectors.append(vector)
return vectors
Flux9665 commented 2 years ago

Good point with the vowel length marker, I'll handle that the same way I'll handle tone, as an additional dimension in the featurevector. I think I've decided also to encode word boundaries into the feature vector. This encoding of properties into a featurevector is kind of a going back to the basics approach, since that has worked super well in the target cost of unit selection synthesis in the past.

Using byte values of unicode characters is a totally viable way to make a text frontend. I just think that it will complicate handling zero-length characters in the aligner. Only phones and pauses should go into the aligner. The idea of the articulatory features however is that the features are super informative such that you can share knowledge between phones. So if you have seen a certain phone much less than others, then information about voicedness or place of articulation can give you some hints on how to produce this sound that you have learned from other phones.

Having a closed set of phones is a bit problematic, so I'll try to include every IPA character in the articulatory features in the future (but maybe not all markers). My goal is kind of to reduce the problem of text-to-speech in multiple languages to a problem of text-to-IPA in those languages and then have a unified procedure for all languages that comes after the text-to-IPA step. And to achieve this, I think having a super informative vector representation of IPA symbols is the best approach, but certainly not the only one. Feel free to experiment with other input representations :)

michael-conrad commented 2 years ago

I've decided that I need to retrain the aligner from scratch, especially as the vowel length markers indicate duration and some of the combined tone sequences imply the same.

I'm currently at step 320,620 with a total loss of 1.017.

michael-conrad commented 2 years ago

323080

Flux9665 commented 2 years ago

Since lengthening and tone are not units on their own, I think it's better to not show them to the aligner as separate units and instead remove them from the phone sequence that goes into the aligner. It predicts based on the audio, not on the phones. So a longer unit or a different pitched version of the same unit will still look like the same unit from the audio perspective and thus give the correct amount of frames. And it would hopefully coincide that units with the lenghtening marker get more frames than units without the lenghtening marker, but purely based on the audio.

But maybe it works still, I'm not sure. If you just train the aligner on a single language and not a lot of different speakers, you only neet a few thousand steps, so over 300,000 is probably pretty overkill. Even though the loss keeps on going down, I think going on for longer won't affect the accuracy of the alignment much.

michael-conrad commented 2 years ago

They may not be units on their on, but the output from my attempted fine tuning with the additional IPA symbols in the FastSpeech routine resulted in garbage output.

I've finished training the aligner on my data: chr, de, en, fr, nl, ru.

I think the finetuning plots look good. Have finally got a working version of the Meta I think going. (All changes committed to my repo fork). The discovery of the hard coded language lists copied from the original was surprising. Can't the language list be pulled from the datasets?

I can attach the fine tuning alignment plots if interested.

Flux9665 commented 2 years ago

The discovery of the hard coded language lists copied from the original was surprising. Can't the language list be pulled from the datasets?

Yes that's totally unclean and a hack. I assumed nobody but me would ever run this code, since there is the Meta Checkpoint already available for download, so I thought everyone could just use that. I will create a version that should hopefully cover every phoneme in the IPA standard and maybe clean up those scripts by then.

I can attach the fine tuning alignment plots if interested.

No thanks, maybe if I train the aligner on sequences that include tone and I run into problems I'll get back to you on that.

Flux9665 commented 2 years ago

Today's release includes support for all phones in the IPA standard and support for tone, lengthening and stress.

Tone and lengthening are added to the previous feature vector and stress is added to the following feature vector. So they are not treated as separate units, but instead modify their surrounding units. I tried this with Chinese and Vietnamese and it seems to work fairly well. With two tonal languages in the pretraining, I'm hoping the ability to finetune the meta checkpoint to Cherokee is improved.

I'm closing this issue, because the requested features is now present, but I'd be very interested in seeing how well this works for Cherokee!