Making a new acoustic model for another language

lsh950919 commented 3 years ago

Hello,

First of all, thanks a lot for a great product. I am enjoying the results that come from this model.

I an trying to train a new acoustic model for Korean language using g2pk as the grapheme to phoneme model and I have a question about the dataset used for training.

I have read your link that specifies information about the data that went into the pretrained models in this description, but the audio dataset that I have has no speaker info due to security reasons, and thus I cannot distinguish the speaker for each audio file.

Is the speaker information necessary for each audio file?

The part that I was worried about was that "the triphone models are used to generate alignments, which are then used for learning acoustic feature transforms on a per-speaker" in your paper, which seems to suggest that audio files must be separated per-speaker in order to learn acoustic features of the speaker to make the model more robust.

Would the speaker information have a large impact on the accuracy of the model?

mmcauliffe commented 3 years ago

Speaker information does help alignment, but even without it, you should be able to get reasonable alignments. If you use 2.0.0a16 that I just released (upgrade via pip install -U montreal_forced_aligner, then you'll be able to run the command mfa align ... --disable_sat. The --disable_sat will only do the first pass alignment, and skip the feature transform estimation and second pass alignment.

lsh950919 commented 3 years ago

@mmcauliffe Thanks a lot. I will try it out.

I would like to ask another question regarding using pretrained models.

I have an acoustic model trained on 12 hour single speaker dataset inside ~/Documents/MFA, and I wanted to use this model to do forced alignment for another audio file that is about 30 minutes long.

However, even after making the dictionary of the 30 minute audio with g2pk module that was used for the 12 hour dataset, I am getting "montreal_forced_aligner.exceptions.PronunciationAcousticMismatchError" for some reason.

I zipped the acoustic_model directory created from training the 12 hour dataset and used its path as acoustic_model_path argument, but the error keeps coming up.

Is it not possible to use an acoustic model created from another dataset, even if the dictionaries were created from the same g2pk module?

If so, would there be a way to make training faster for a large audio file? I tried training the 30 minute audio file itself but it was taking ~9 hours for the monophone training and then it shut down with a montreal_forced_aligner.exceptions.KaldiProcessingError.

Here is the error message of the mismatch error.

화면 캡처 2021-05-20 185550

and this is the image of the meta.yaml from the acoustic_model from training the 12 hour dataset

화면 캡처 2021-05-20 185742

where the encoded characters like \u1100 correspond to Korean characters

lsh950919 commented 3 years ago

Never mind with the question above. I updated the package as you told me and created the model zip file using the -o argument of mfa train.

The error disappeared when I tried mfa align using the zip file, but the 30 minute audio file is not aligned with "Could not decode (beam too narrow)" in unaligned.txt.

I found on the official document that the audio files are recommended to be segmented to less than 30 seconds so this is what I assume to be the problem.

My goal of using the forced aligner was to try to split the audio file according to the alignment, but now this doesn't seem to be a viable option so I will have to look for other options to split up the audio file and then get the alignments.

If anyone could help me find another solution that would be great.

Thanks for the help!

mmcauliffe commented 3 years ago

You can set the beam higher for alignment and see if that helps (mfa align ... --beam 1000 or something). The default alignment beam is 100 with a retry of 400, but it's geared towards those shorter files. 30 minutes of audio is definitely a bit tricky to align, and at the best case will just take a really long time to generate. I would recommend some sort of segmentation process (see here: https://montreal-forced-aligner.readthedocs.io/en/latest/create_segments.html), but it does require more manual intervention, and splitting up the transcript into those segments is tricky as well.

MontrealCorpusTools / Montreal-Forced-Aligner

Making a new acoustic model for another language #285