Option to use aligned data for more accurate model training (human reinforced learning)

MLo7Ghinsan commented 11 months ago

Is your feature request related to a problem? Please describe. The current acoustic training works from scratch and the resulting model feels like it force-guesses the phonetic marker placements without any guidance. This can easily lead to unstable alignments especially with smaller datasets. Using human-labelled data as evaluation examples or training feedback could significantly improve the stability and quality of output alignments.

Describe the solution you'd like It would be great to have an acoustic training option/flag that will enable taking aligned data as input, instead of transcriptions. Since the current training only takes transcriptions into consideration, it's more difficult for the model to learn how to align data, and the resulting model can perform poorly on input it had never seen before. If there's an option to have it accept a human-labelled dataset for validation or as input dataset, it could potentially improve the training process, and most likely the model's performance. With this, models that had been trained with pre-aligned data could return much more accurate timing placements for not only speech input, but also potential singing input. We've found that aligning phonemes for singing is more complex of a task for MFA, and we hope this kind of training feature would make it possible. (significantly cutting on the need of having hundreds of hours of data)

Describe alternatives you've considered Doesn't apply

Additional context Training data in this suggestion should be phoneme-level alignment in each file because if we also have word-level alignment in the same file (.TextGrid file can have multi-layers alignment) MFA training will think of it as another speaker. We can use word to phonemes dict to align the audio in the model usage For .lab we usually align audio and phonemes in HTK label format

Here's a visual example of the aligned file: pjs_singing__visual_example

And here are the example files for both speech and singing if you want to check them out (wav, lab, and textgrid files included): pjs_speech_and_singing_example.zip

Note 1: SP phoneme is for silence, AP phoneme is for breath Note 2: SP phonemes are replaced with no phoneme indicating silence in the TextGrid files Corpus presented: PJS corpus

mmcauliffe commented 11 months ago

Yeah so I think it'd be possible to do something like this now with MFA 3.0 and kalpy. I'll have to think exactly how to do it, but I should be able to load reference alignments the same as for alignment evaluation (https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/alignment.html#cmdoption-mfa-align-reference_directory), and then I think it'd be a matter of running through alignments each pass and constraining the transition IDs to the phones for the reference alignment. I think it's possible, but I'll have to play around with it.

MLo7Ghinsan commented 11 months ago

Thank you for your consideration. We will await any updates or news regarding this topic.

MontrealCorpusTools / Montreal-Forced-Aligner

Option to use aligned data for more accurate model training (human reinforced learning) #685