Align transcript and speech (US + UK)

AvivSham commented 1 year ago

Hi All, Thank you for this amazing repo really nice work! We wish to align transcript and speech (english UK + US) what is the correct way to do it? If it's possible we prefer to use ARPA phone set.

Thank you in advance! @yochaiye

yochaiye commented 1 year ago

Just to add - we tried to used Use Case 1 with the English MFA dictionary v2_0_0 which covers both UK and US English, but this dictionary does not cover many words included in our dataset

mmcauliffe commented 1 year ago

Ok, I've added a new Use Case 2 here: https://montreal-forced-aligner.readthedocs.io/en/latest/first_steps/index.html#use-cases, with some extra functionality for expanding pronunciation dictionaries in 2.2.3, so if you update to that and run through the steps there, you should be able to expand out any of the pretrained dictionaries.

I will say that I do not recommend ARPA for UK English, given that it's only been trained on 1K hours of US English and ARPA only makes sense for US English, so I would not be surprised to see it struggle with r-lessness. The English MFA model has more UK English and world Englishes training data (though it is still slanted towards US dialects, which I'm hoping to address a bit in a new release soon ish). The English MFA dictionary contains all pronunciations for all dialects, so if you want to constrain the pronunciation space to just UK English for your UK speakers and just US for your US speakers, you can specify per-speaker dictionaries: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/dictionary.html#per-speaker-dictionaries for use with the English MFA model.

Hope that helps!

AvivSham commented 1 year ago

Thank you @mmcauliffe for your fast response.

For the phase of creating OOVS file by running: mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/g2pped_oovs.txt --dictionary_path english_us_arpa What is the expected structure for the corpus file? one unified txt file with multiple lines for each sample? single line with all text? (if it's non of the above please help us understand what is the correct structure)

Thanks.

mmcauliffe commented 1 year ago

The general format for corpora in MFA is https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/corpus_structure.html, but for the g2p command it'll use any text files (.txt, .lab, and .TextGrid) you have in the corpus directory for constructing the word list to run G2P on.

AvivSham commented 1 year ago

Thank you for your response @mmcauliffe. Our dataset contains a mix of English US and UK without the metadata of which sample is US/UK. I guess there is no g2p that handles such case so I wonder which use case we should follow. Would that still be case 2 or maybe case 5 is what we are looking for?

MontrealCorpusTools / mfa-models

Align transcript and speech (US + UK) #20