MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.35k stars 249 forks source link

[BUG] export to textgrid fails because of apostrophe in transcript #629

Closed jeffmielke closed 1 year ago

jeffmielke commented 1 year ago

Debugging checklist

[x] Have you updated to latest MFA version? [x] Have you tried rerunning the command with the --clean flag?

Describe the issue When a transcript includes a word with an apostrophe such as TRAVELIN', validation and alignment seem to go just fine, but there is an error exporting to the textgrid:

WARNING There were 1 errors encountered in generating TextGrids. Check raleigh_23_05_ok_output/output_errors.txt
for more details

output_errors.txt says this:

The following exceptions were encountered during the output of the alignments to TextGrids:

AlignmentExportError:

Error was encountered in exporting raleigh_23_05_ok_output/ral2060d.TextGrid:

Traceback (most recent call last):

File "/home/jimielke/.conda/envs/aligner/lib/python3.10/site-packages/montreal_forced_aligner/alignment/multiprocessing.py", line 2498, in run export_textgrid(

File "/home/jimielke/.conda/envs/aligner/lib/python3.10/site-packages/montreal_forced_aligner/textgrid.py", line 377, in export_textgrid tier.insertEntry(a.to_tg_interval(duration))

File "/home/jimielke/.conda/envs/aligner/lib/python3.10/site-packages/montreal_forced_aligner/data.py", line 1757, in to_tg_interval assert begin < end

AssertionError

The error occurs when if there is a similar dictionary entry without the apostrophe (such as TRAVELIN), even if TRAVELIN' (with the apostrophe) is in the dictionary, The problem goes away when I remove the apostrophe-less dictionary entry. It was easy to fix once I figured out what the problem was, but the error message didn't provide a lot of clues that helped me find the problem.

For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in? English
    • How many files/speakers? 247
    • Are you using lab files or TextGrid files for input? TextGrid
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one? custom
    • If it's a custom dictionary, what is the phoneset? Arpabet
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one? english_us_arpa
    • If it's a model you've trained, what data was it trained on?

Log file The log file is from a run with a subset of the files (the ones with problems).

Desktop (please complete the following information):

Additional context Add any other context about the problem here. pg_log_global.txt

mfaytak commented 1 year ago

Just to confirm this, I am getting a the same error for my data set (Mac OS 13.4, conda install and very recently updated to latest version). The orthography contains a lot of final glottal stops written as <'>, and a lot of words differing from those only in the absence of the glottal stop. The one output TextGrid which is produced happens to contain only one word with a final glottal stop (ngo') with no corresponding glottal-free word anywhere in the transcript. All other files seem to encounter the issue described here. Unfortunately, I cannot remove the lexical items with final <'> since this is a phoneme; I will probably have to edit all my transcripts to change the phone set entirely.

Worth noting that in the out output file I got, any word following the apostrophe-containing word is not treated as a separate word: transcribed <ngo' abɨ> comes out as the "word" <ngo'abɨ>, for example.

mmcauliffe commented 1 year ago

Ah sorry, haven't had a chance to look into this, but you should be able to specify --no_textgrid_cleanup to disable the behavior, or also specify a config file with

clitic_markers:

to prevent them being analyzed as clitics.

jeffmielke commented 1 year ago

Thanks, Michael. I have a vague memory of --no_textgrid_cleanup. Sorry if this is my second time asking this question. It does seem like the default behavior may be problematic and unexpected for a lot of users. Thanks, Matt, for figuring out about the words being combined. I have noticed words like that in the textgrids but I hadn't figured out what was causing them.

On Wed, Jun 7, 2023 at 11:52 PM Michael McAuliffe @.***> wrote:

Ah sorry, haven't had a chance to look into this, but you should be able to specify --no_textgrid_cleanup to disable the behavior, or also specify a config file with

clitic_markers:

to prevent them being analyzed as clitics.

— Reply to this email directly, view it on GitHub https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/629#issuecomment-1581849644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH3Q3BKM6HT7T6SNMXSOSLTXKFEADANCNFSM6AAAAAAX6XBV5Y . You are receiving this because you authored the thread.Message ID: @.*** com>

mfaytak commented 1 year ago

Following up to confirm that --no_textgrid_cleanup does have the desired effect - all words separated and all files output.