[Wav2Vec2] Zero-shot Transfer model on Common Voice - Espeak phonetizer

patrickvonplaten commented 2 years ago

Question

Hey Quintong, @alexeib, @michaelauli,

Thanks a lot for open-sourcing the model weights of your recent paper Simple and Effective Zero-shot Cross-lingual Phoneme Recognition! I can run the models well and the output seems coherent. E.g., on a LibriSpeech speech input with the transcription A MAN SAID TO THE UNIVERSE SIR I EXIST:

The following models decode this input audio to the following phoneme sequence:

LV-60 | CommonVoice | 26 | Espeak | download | download decodes to ɐmænsɛdtəðəjuːnɪvɚssɚaɪɛɡzɪst
XLSR-53 | CommonVoice | 26 | Espeak | download | download decodes to ɐmænsɛdtəðəjuːnɪvɚssɚaɪɛɡzɪst

Note that the models logits does not predict a word splitting character (the |) since it's also not in the dictionary. Now if I use the espeak phonemizer with the phonemizer library as follows:

from phonemizer import phonemize
from phonemizer.separator import Separator

phonemize("a man said to the universe sir i exist", lang="en-us", backend="espeak", strip=True, separator=Separator(phone="", word="")

I get the output: ɐmænsɛdtəðəjuːnɪvɜːssɜːɹaɪɛɡzɪst which is more or less the same as the prediction of the model - see:

+ɐmænsɛdtəðəjuːnɪvɚssɚaɪɛɡzɪst  # prediction
-ɐmænsɛdtəðəjuːnɪvɜːssɜːɹaɪɛɡzɪst # phonetized

=> So I'm assuming that the phonemizer command:

phonemize("a man said to the universe sir i exist", lang="en-us", backend="espeak", strip=True, separator=Separator(phone="", word="")

is correct here. However I couldn't find any file that confirms this. Could you take a look to see whether the phonemizer command is correct? This would make training such a model possible :-) Also if I now want to decode a French input sample, I would simple replace lang="en-us" with lang="fr-fr" no (I've tried it and it also gave me very good results)?

Also I had to more questions: 1) Given that the output string is just a sequence of phonemes, is phoneme error rate (PER) the same as character error rate? IMO it should be the same no? 2) Is there anyway to map such an output sequence of phonemes back to a "human-readable" output string. E.g. for a given language can I define a dictionary mapping that maps each phoneme to a character/subword?

Thanks a lot!

xuqiantong commented 2 years ago

Hi Patrick. Yes, your command is correct, and changing lang option is enough to phonemize different languages. We usually use separator=Separator(phone=' ', syllable='', word='') to separate phonemes, in order not to do any manual parsing afterwards.

PER != CER, as each phoneme may contain more than 1 characters. You need to parse them correctly in the training data. and the output from the model should be separated correctly as well.
Yes, you can have a phoneme -> word lexicon, and then use that in the beam-search decoder directly.

patrickvonplaten commented 2 years ago

Thanks a mille @xuqiantong - that's super useful! One last question regarding what was said above:

Note that the models logits does not predict a word splitting character (the |) since it's also not in the dictionary.

Why was the "|" not added as a token to the dictionary and the token embeddings? Wouldn't it make sense to train the model with "|" so that it's easier to later map a phoneme sequence to a word sequence (since "|" would allow one to know what phonemes belong to the same word).

Do you think leaving out "|" in the dictionary improves phoneme error rate?

Thanks a lot!

xuqiantong commented 2 years ago

Hi @patrickvonplaten, this is a good question, inserting "|" between words will definitely help in decoding to words. We didn't do this in our work simply because we only focused on phonome recognition.

Feel free to try all the other possibilities :)

facebookresearch / fairseq

[Wav2Vec2] Zero-shot Transfer model on Common Voice - Espeak phonetizer #4045

Question