MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 248 forks source link

Use aligner as a library without creating corpus #140

Closed manrajgrover closed 2 years ago

manrajgrover commented 4 years ago

Description

Currently, from what I see, the aligner can mainly be used as a command line tool where we pass a dataset containing wav and txt files and it aligns it. I wish to use it as a service where I pass one audio at a time with transcription and get the alignments. This is similar to Gentle tool. Is there a specific reason why the aligner expects a group of audios and not a single audio? Is there a documented way to consume it as a library?

daniels20000 commented 4 years ago

I as well am looking for some form of python library for forced alignment. It would give vast improvements in regard of flexibility.

SaadBazaz commented 2 years ago

@daniels20000 Check out aeneas. Using it currently for English. It's pretty fast, and can be used directly in Python as a library.

mmcauliffe commented 2 years ago

I'm going to close this, recent versions of MFA with the conda installation allow for proper library use, see here: https://montreal-forced-aligner.readthedocs.io/en/latest/reference/index.html for the API docs.

With regards to aenas, I might be missing something, but it's not actually a forced aligner right? It looks like it uses a TTS engine to generate audio clips and align those to the original audio, so it's not going to be great for word or phone level alignment. Just seems like it's solving a different problem than MFA does.

SaadBazaz commented 2 years ago

According to their README, they claim to be doing forced alignment. I ran some tests and it works pretty great, specially for basic stuff (clean audio, simple English words, etc). While MFA uses HMM, they use DTW. Just two different approaches to the same problem.

For word-level alignments, one can simply use fragment size = 1 word.

Thanks for sharing the docs, though. I might consider MFA now thanks to this 👍

mmcauliffe commented 2 years ago

Hmm, yeah, I might have a narrower definition of forced alignment as specifically referring to the the step in training ASR models where phones are aligned to audio (and then word alignments are derived from the phone alignments), and so I would consider aeneas as targeting more the segmentation problem. Given a long audio file, finding roughly what time periods correspond to to what sections in the transcript.

I'll take a closer look at aeneas and see if it's worth including in future forced alignment benchmarking, but my guess would be that it will struggle on more informal spontaneous speech

SaadBazaz commented 2 years ago

Heck, worth a try. There's a nice list here if it helps.

I used Gentle too, but I found aeneas to be blazingly fast. Probably because of their approach and C-level implementation.