MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.29k stars 242 forks source link

is there text to phoneme conversion without the alignment? #486

Open binbinxue opened 2 years ago

binbinxue commented 2 years ago

Is there direct command line to call to convert text to phonemes? i don't want the alignments to the audio, just the phonemes. The use case is after training with a TTS model, in the inference time we'd need the phonemes as input that doesn't come with audio pair. I know a third party library called g2p_en that does this just for US english, it uses nltk POS of the word to predict the phoneme for the word depending on the context. But it's nothing compared to MFA which has a lot more functionalities and languages support. It would be nice to have this functionality. If it exists already, please point me to the right direction.

mmcauliffe commented 2 years ago

So I understand the use case a bit better:

  1. Are there audio files involved at all?
  2. If it's just using text transcriptions, then the general mfa g2p functionality should cover it
    • You should be able to make a text file with a line per utterance, and then run g2p with that as the input "word list". Any spaces should just get ignored by default when running g2p
  3. But otherwise you're looking for an output format that instead of a TextGrid, it's just a lab file with phones? That should be pretty doable, but you can also probably output via JSON and then do some postprocessing to get the phone labels.
binbinxue commented 2 years ago

thanks so much for your reply. no audio files involved. I was thinking more of just text to phonemes conversion. I was looking at the mfa g2p but it's only outputting the top likely pronunciation of words(if that's the correct understanding). What i was thinking is more like beam search over a text sentence and based on the context produce most likely phoneme sequence.

I think this would be extremely useful for speech synthesis, there are third party python tools but they are either just word lookups which doesn't deal with homographs or misspelled words or use neural networks which can be heavy to use and to be honest not as good result as your MFA. Let me know if you'll be considering the possibility of adding this functionality. Thanks!

Hertin commented 1 year ago

Any updates on the text-to-phoneme conversion? The suggestion "You should be able to make a text file with a line per utterance, and then run g2p with that as the input "word list". Any spaces should just get ignored by default when running g2p" seems to give a phoneme dictionary instead of a sequence of transcribed phonemes. Thanks

hypnaceae commented 1 year ago

You can use gruut: https://rhasspy.github.io/gruut/

saber258 commented 1 year ago

Can't agree more. I'm finding a text-to-phoneme tool for japanese because the inference processing can not use audio...... I tried to use mfa g2p at both word list and corpus level, but the output always consists of the same word with many possibilities......Even if one-sentence input, it may produce several possibilities for a single word.

mmcauliffe commented 1 year ago

2.2.4 will add support for this, but I would still generally caution against relying on it, given that homographs are going to be a huge issue and it's only going to pick the best one for a word without the surrounding context and skip over any graphemes it hasn't seen in training data. Both of those are pretty significant for any kanji g2p, since there are going to be missing pronunciations and often 2+ ways of producing a kanji sequence, particulary for names, but for kana it should be ok. I did also add support for stdin/stdout piping for mfa g2p in 2.2.4, which I would recommend using with some sanity checks.

binbinxue commented 1 year ago

why not use the surrounding context? something like hidden markov model that inherently captures the context?

mmcauliffe commented 1 year ago

There's no surrounding context because the G2P models are built on pronunciation dictionaries, so the only context is word-internal. There's no notion of part of speech tags. I've been thinking about expanding lexicons to include more information like parts of speech and linking surface forms, but I don't think that'll help here, but I dunno, I haven't looked into doing part of speech tagging for all the languages since it's generally pretty messy and falls apart of spontaneous speech.

Looking over the gruut docs, it might be possible to replace the G2P calls with mfa's G2P, but it's doing functionally the same thing with the same style of model. So you might want to cut them a ticket to see if it's possible to slot them somehow. The more I think about it, the more I'm convinced that this isn't really a use case that MFA will support very well, given its focus on recognition and accepting variation for alignment.

As an example, if we look at how G2P works out of the box, for something like "the cat read a book", the English US MFA model gives ð cʰ æ t ɹ iː d ə b ʊ k, which has the wrong "read", a non-pronunciation for "the" (because most instances of that three letter string in English don't have a vowel), etc. So you'd really have to incorporate lexicon support like gruut/g2p-en does, along with POS tagging, etc, but again, MFA lexicons are optimized for recognition of spontaneous speech, which is likely not what you want for a TTS system. So I do think the best path is to improve those packages that are specifically for TTS, since MFA is not intended for TTS use.

saber258 commented 1 year ago

2.2.4 will add support for this, but I would still generally caution against relying on it, given that homographs are going to be a huge issue and it's only going to pick the best one for a word without the surrounding context and skip over any graphemes it hasn't seen in training data. Both of those are pretty significant for any kanji g2p, since there are going to be missing pronunciations and often 2+ ways of producing a kanji sequence, particulary for names, but for kana it should be ok. I did also add support for stdin/stdout piping for mfa g2p in 2.2.4, which I would recommend using with some sanity checks.

Thank you for your reply. As you said, homographs are the key problem, especially in kanji and maybe I have to look for other text-to-phoneme tools for japanese.

Exactly, there are some g2p tools for japanese such as CharsiuG2P and Phonemizer......However, I found that although G2P in MFA using IPA charts for phoneme set, there is a little difference between the phoneme set in CharsiuG2P and MFA (both using IPA charts). I think maybe the different version of IPA charts were used.

In this situation, if I want to use other tools for text-to-phoneme and use MFA for aligning, is it necessary for me to re-train MFA align model in a large corpus using new dictionary consisting of new phoneme set or just re-train in my small train data?

I saw Use case 4 in Getting Started part in MFA home page: Training a new acoustic model on a corpus. The command in this part is mfa train ~ .

The output of this command can be both trained model and TextGrids alignments, which means it trains in this new dataset and produce TextGrids alignments for this dataset. Is this processing will influence the quality of TextGrids alignments? Is it necessary for me to pretrain mfa align model using new dictionary in another large corpus and then use it for my dataset?

mmcauliffe commented 1 year ago

2.2.4 will add support for this, but I would still generally caution against relying on it, given that homographs are going to be a huge issue and it's only going to pick the best one for a word without the surrounding context and skip over any graphemes it hasn't seen in training data. Both of those are pretty significant for any kanji g2p, since there are going to be missing pronunciations and often 2+ ways of producing a kanji sequence, particulary for names, but for kana it should be ok. I did also add support for stdin/stdout piping for mfa g2p in 2.2.4, which I would recommend using with some sanity checks.

Thank you for your reply. As you said, homographs are the key problem, especially in kanji and maybe I have to look for other text-to-phoneme tools for japanese.

Exactly, there are some g2p tools for japanese such as CharsiuG2P and Phonemizer......However, I found that although G2P in MFA using IPA charts for phoneme set, there is a little difference between the phoneme set in CharsiuG2P and MFA (both using IPA charts). I think maybe the different version of IPA charts were used.

Right, from my understanding, they're using raw wikipron entries, I believe, whereas I've done some standardization across languages (hence why I call the phone set MFA rather than IPA). You can see the transformations I've done here: https://mfa-models.readthedocs.io/en/latest/mfa_phone_set.html

In this situation, if I want to use other tools for text-to-phoneme and use MFA for aligning, is it necessary for me to re-train MFA align model in a large corpus using new dictionary consisting of new phoneme set or just re-train in my small train data?

I saw Use case 4 in Getting Started part in MFA home page: Training a new acoustic model on a corpus. The command in this part is mfa train ~ .

The output of this command can be both trained model and TextGrids alignments, which means it trains in this new dataset and produce TextGrids alignments for this dataset. Is this processing will influence the quality of TextGrids alignments? Is it necessary for me to pretrain mfa align model using new dictionary in another large corpus and then use it for my dataset?

You should be fine training and exporting alignments on just your dataset, provided that it's a decently large corpus. I'm not too sure for the TTS use case how much speech is typically used, but I've historically recommended that >20 hours of speech will often give similar or better alignments than using pretrained models (i.e., training on the 20 hour buckeye corpus gives better alignment metrics than a pretrained model trained on 1000 hours of librispeech), but also see some experiments here: https://memcauliffe.com/how-much-data-do-you-need-for-a-good-mfa-alignment.html.

saber258 commented 1 year ago

Thank you for your help!

However, I think I can't guarantee my corpus is a decently large corpus......And what's worse......I tested the CharsiuG2P and it seems like CharsiuG2P is also not good at homographs problem.

I think maybe text-to-phoneme method is not suitable for Japanese because of the homographs. Maybe, text-to-kana-to-phoneme is a useful method to solve this problem. So I investigated some tools and fortunately found pyopenjtalk can transfer japanese through text-to-kana-to-phoneme. And through test, it have a high ability to deal with homographs problem. But I don't want to use the phone set of pyopenjtalk.

Therefore, I just intend to use pyopenjtalk for text-to-kana and feed the outputs, which are consists of kana, into MFA aligner. I think I don't need to change anything about phone set in MFA and the only thing I need to do is expanding the vocabulary dictionary through G2P model in MFA for there are not so many corresponding kana words in MFA dictionary.

It is possible for me to directly transfer the all kanji in MFA dictionary to kana before MFA aligning. And after I use the command of mfa validate~ and get oov list, I run mfa g2p ~ to generate corresponding kana words phoneme of oov list in your standardization. Then, I can run mfa align~.

Is my method and process right? Or maybe I misunderstood something?

mmcauliffe commented 1 year ago

Yep, that sounds right. That's basically what I've done for generating the pronunciation dictionary in the first place, getting hiragana representations for kanji first and running G2P on that because it's more transparent.