Describe the issue
When running validate mfa with or without the --clean flag and --ignore_acoustics, the process completes smoothly. However, at the end, the directory where the oovs_found.txt file is saved is deleted.
For Reproducing your issue
Please fill out the following:
Corpus structure
What language is the corpus in? >>>>>> Italian
How many files/speakers? >>>>>> 66 files, 1 speaker
Are you using lab files or TextGrid files for input? >>>>>> TextGrid files
Dictionary
Are you using a dictionary from MFA? If so, which one? >>>>>> italian_cv
If it's a custom dictionary, what is the phoneset?
Acoustic model
If you're using an acoustic model, is it one download through MFA? If so, which one?
If it's a model you've trained, what data was it trained on?
Log file
$ mfa validate --ignore_acoustics .../988/ italian_cv
INFO Setting up corpus information...
INFO Loading corpus from source files...
66% ?????????????????????????????????????????????????????????????????????? 66/100 [ 0:00:01 < -:--:-- , ? it/s ]
INFO Found 1 speaker across 66 files, average number of utterances per
speaker: 66.0
INFO Initializing multiprocessing jobs...
WARNING Number of jobs was specified as 3, but due to only having 1 speakers,
MFA will only use 1 jobs. Use the --single_speaker flag if you would
like to split utterances across jobs regardless of their speaker.
INFO Normalizing text...
100% ???????????????????????????????????????????????????????????????????????? 66/66 [ 0:00:01 < 0:00:00 , ? it/s ]
INFO Skipping acoustic feature generation
INFO Corpus
INFO 66 sound files
INFO 66 text files
INFO 1 speakers
INFO 66 utterances
INFO 749.552 seconds total duration
INFO Sound file read errors
INFO There were no issues reading sound files.
INFO Feature generation
INFO Acoustic feature generation was skipped.
INFO Files without transcriptions
INFO There were no sound files missing transcriptions.
INFO Transcriptions without sound files
INFO There were no transcription files missing sound files.
INFO Dictionary
INFO Out of vocabulary words
WARNING 24 OOV word types
WARNING 517total OOV tokens
WARNING For a full list of the word types, please see:
/.../MFA/988/oovs_found.txt. For a by-utterance
breakdown of missing words, see:
/.../MFA/988/utterance_oovs.txt
INFO Skipping test alignments.
INFO Done! Everything took 14.120 seconds
Desktop (please complete the following information):
OS: Linux
Version : Ubuntu 20.04.5
Additional context
During validation, I observed that OOV files are being created in the MFA directory, but the entire directory 988 is deleted at the end.
I’m currently stuck because I have a large number of audio files with transcriptions, but only a fraction of them are exported to TextGrid after alignment. I suspect this may be due to OOV words. I’d like to add these OOV words to a custom dictionary, but without the report in oovs_found.txt, I’m unable to proceed. Can anyone confirm this issue or suggest a solution?
Debugging checklist
[ x] Have you read the troubleshooting page (https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/troubleshooting.html) and searched the documentation to ensure that your issue is not addressed there? [ x] Have you updated to latest MFA version (check https://montreal-forced-aligner.readthedocs.io/en/latest/changelog/changelog_3.0.html)? What is the output of
mfa version
? [ x] Have you tried rerunning the command with the--clean
flag?Describe the issue When running validate mfa with or without the --clean flag and --ignore_acoustics, the process completes smoothly. However, at the end, the directory where the oovs_found.txt file is saved is deleted.
For Reproducing your issue Please fill out the following:
Log file
$ mfa validate --ignore_acoustics .../988/ italian_cv INFO Setting up corpus information...
INFO Loading corpus from source files...
66% ?????????????????????????????????????????????????????????????????????? 66/100 [ 0:00:01 < -:--:-- , ? it/s ] INFO Found 1 speaker across 66 files, average number of utterances per
speaker: 66.0
INFO Initializing multiprocessing jobs...
WARNING Number of jobs was specified as 3, but due to only having 1 speakers, MFA will only use 1 jobs. Use the --single_speaker flag if you would
like to split utterances across jobs regardless of their speaker.
INFO Normalizing text...
100% ???????????????????????????????????????????????????????????????????????? 66/66 [ 0:00:01 < 0:00:00 , ? it/s ] INFO Skipping acoustic feature generation
INFO Corpus
INFO 66 sound files
INFO 66 text files
INFO 1 speakers
INFO 66 utterances
INFO 749.552 seconds total duration
INFO Sound file read errors
INFO There were no issues reading sound files.
INFO Feature generation
INFO Acoustic feature generation was skipped.
INFO Files without transcriptions
INFO There were no sound files missing transcriptions.
INFO Transcriptions without sound files
INFO There were no transcription files missing sound files.
INFO Dictionary
INFO Out of vocabulary words
WARNING 24 OOV word types
WARNING 517total OOV tokens
WARNING For a full list of the word types, please see:
/.../MFA/988/oovs_found.txt. For a by-utterance
breakdown of missing words, see:
/.../MFA/988/utterance_oovs.txt
INFO Skipping test alignments.
INFO Done! Everything took 14.120 seconds
Desktop (please complete the following information):
Additional context During validation, I observed that OOV files are being created in the MFA directory, but the entire directory 988 is deleted at the end.
I’m currently stuck because I have a large number of audio files with transcriptions, but only a fraction of them are exported to TextGrid after alignment. I suspect this may be due to OOV words. I’d like to add these OOV words to a custom dictionary, but without the report in oovs_found.txt, I’m unable to proceed. Can anyone confirm this issue or suggest a solution?