MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

[BUG] TextGrid Output gives incorrect `intervals: size` when `includeBlankSpaces == True` #756

Closed kw-colo closed 4 months ago

kw-colo commented 4 months ago

Debugging checklist

[X] Have you updated to latest MFA version? [X] Have you tried rerunning the command with the --clean flag?

Describe the issue I'm not an expert on anything here, but I was trying to parse the TextGrid ouptut with the TextGrid Python library when I ran into an issue. That library uses the intervals: size keyword in the TextGrid format to tell how large an intervals list is and doesn't have contingency case for when that size is wrong. In the MFA output, that size is wrong. This is from 27-123349-0041.TextGrid from the LibreSpeech example:

item []: 
    item [1]:
        class = "IntervalTier" 
        name = "words" 
        xmin = 0 
        xmax = 16.015 
        intervals: size = 40 
        intervals [1]:
            xmin = 0.0 
            xmax = 0.17 
            text = "" 
...
        intervals [40]:
            xmin = 13.62 
            xmax = 13.71 
            text = "the" 
        intervals [41]:
            xmin = 13.71 
            xmax = 14.24 
            text = "blunders" 
        intervals [42]:
            xmin = 14.24 
            xmax = 14.44 
            text = "of" 
        intervals [43]:
            xmin = 14.44 
            xmax = 14.83 
            text = "colonel" 
        intervals [44]:
            xmin = 14.83 
            xmax = 14.86 
            text = "" 
        intervals [45]:
            xmin = 14.86 
            xmax = 15.33 
            text = "bevil" 
        intervals [46]:
            xmin = 15.33 
            xmax = 15.9 
            text = "skelton" 
        intervals [47]:
            xmin = 15.9 
            xmax = 16.015 
            text = "" 

Near as I can tell, this behavior stems from the interval_index in textgrid.py being able grow past len(tier._entries) when includeBlankSpaces is enabled. Relevant code (comments mine):

                    fd.write(tab * 2 + f"intervals: size = {len(tier._entries)} \n") # <-- This is where the intervals: size key is set
                    interval_index = 1
                    if includeBlankSpaces and tier._entries:
                        if tier._entries[0][0] > 0.001:
                            fd.write(
                                f"{tab * 2}intervals [{interval_index}]:\n"
                                f"{tab * 3}xmin = 0.0 \n"
                                f"{tab * 3}xmax = {tier._entries[0][0]} \n"
                                f'{tab * 3}text = "" \n'
                            )
                            interval_index += 1 # <-- Interval index could potentially be iterated here

                    for i, entry in enumerate(tier._entries): # <-- in a for loop, which should interate only until i = len(tier._entries) - 1
                        start, end, label = entry
                        if (
                            includeBlankSpaces
                            and i > 0
                            and start - tier._entries[i - 1][1] > 0.001
                        ):
                            fd.write(
                                f"{tab * 2}intervals [{interval_index}]:\n"
                                f"{tab * 3}xmin = {tier._entries[i-1][1]} \n"
                                f"{tab * 3}xmax = {start} \n"
                                f'{tab * 3}text = "" \n'
                            )
                            interval_index += 1 # <-- interval_index gets iterated if there's a blank space
                        fd.write(
                            f"{tab * 2}intervals [{interval_index}]:\n"
                            f"{tab * 3}xmin = {start} \n"
                            f"{tab * 3}xmax = {end} \n"
                            f'{tab * 3}text = "{tgio_utils.escapeQuotes(label)}" \n'
                        )
                        interval_index += 1 # <-- interval_index gets iterated again
                    if includeBlankSpaces and tier._entries:
                        if self.maxTimestamp - tier._entries[-1][1] > 0.001:
                            fd.write(
                                f"{tab * 2}intervals [{interval_index}]:\n"
                                f"{tab * 3}xmin = {tier._entries[-1][1]} \n"
                                f"{tab * 3}xmax = {self.maxTimestamp} \n"
                                f'{tab * 3}text = "" \n'
                            )
                            interval_index += 1 # <-- and possibly iterated again near the end

For Reproducing your issue

Please fill out the following: Run the LibreSpeech example and check the output

This is using the exact instructions on the website, so I won't fill out the following

  1. Corpus structure
    • What language is the corpus in?
    • How many files/speakers?
    • Are you using lab files or TextGrid files for input?
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one?
    • If it's a custom dictionary, what is the phoneset?
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one?
    • If it's a model you've trained, what data was it trained on?

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA). N/A Desktop (please complete the following information):

Additional context I'll try to work around this today, but I can possibly put in a PR later if I can figure out how to test and it is fairly simple to solve

mmcauliffe commented 4 months ago

Oh good call, thanks, I'll get a fix out for that shortly!

mmcauliffe commented 4 months ago

Resolved in 3.0.0rc2