MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

[BUG] oovs files not generated correctly (and then deleted) #819

Closed Kedersha closed 1 week ago

Kedersha commented 3 weeks ago

Debugging checklist

[X] Have you read the troubleshooting page (https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/troubleshooting.html) and searched the documentation to ensure that your issue is not addressed there? [X] Have you updated to latest MFA version (check https://montreal-forced-aligner.readthedocs.io/en/latest/changelog/changelog_3.0.html)? What is the output of mfa version? running version 3.1.1 [X] Have you tried rerunning the command with the --clean flag?

Describe the issue OOV files are being deleted after validation is complete; also, they themselves seem incomplete. I have 2 mini corpora of mp3s in Romanian and Greek (from https://www.omniglot.com/language/phrases/romanian.php and https://www.omniglot.com/language/phrases/greek.php) consisting of 15 sound files and their transcriptions. The problem exists for both of them, but I'll just relate it for Romanian.

Running mfa validate "PATH\MFA\Miniromanian" romanian_cv romanian_cv, I get messages saying

INFO     Out of vocabulary words
 WARNING  15 OOV word types
 WARNING  37total OOV tokens
 WARNING  For a full list of the word types, please see: PATH\MFA\Miniromanian\oovs_found.txt.
          For a by-utterance breakdown of missing words, see:
          PATH\MFA\Miniromanian2\utterance_oovs.txt

These files aren't written as described. "oovs_found.txt" doesn't exist at all. There are three temp files that exist only as long as it takes for the model to complete its training, named "oov_counts_romanian_cv.txt", "oovs_found_romanian_cv.txt" and "utterance_oovs.txt". These files are deleted when training is complete.

I ran it again and copied these files to another directory to save them from deletion. When I opened them, they only had 5 lines each, rather than the 15 or 37 I'd've expected from the message.

For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in? Romanian (applied to Greek too)
    • How many files/speakers? 15 files, 1 speaker
    • Are you using lab files or TextGrid files for input? .txt files ... is that okay?
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one? romanian_cv
    • If it's a custom dictionary, what is the phoneset? N/A
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one? romanian_cv
    • If it's a model you've trained, what data was it trained on? N/A

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA). Miniromanian2.log

Desktop (please complete the following information):

amo104 commented 2 weeks ago

I'm getting a similar issue. I'm trying to look into some of the final temp files from validating. I'm running with both the --debug and --no_final_cleanup flags on. But, like you, I'm still getting this as the last line of the log:

DEBUG - Cleaning up temporary files, use the --debug flag to keep temporary files.