Documentation for segmenting audio with provided transcript

gadese commented 9 months ago

Is your feature request related to a problem? Please describe. I'd like to try the Segment Transcript utility. I am dealing with very long audio (between 25 and 90 minutes) including multiple speakers, and I already have the associated text in the form of a script (.txt file)

For example:

SPEAKER 1
This is a sample sentence.
SPEAKER 2
This is a sentence from Speaker 2.
Speaker 1
Another sentence from speaker 1.
...

I've read the docs and tutorials, but this segment utility really seems under documented, specifically the format for the input corpus. It's unclear to me how to format my transcript (and the directory structure).

For pre-segmented audio/text it's pretty clear that I should be following the prescribed corpus format. But in my case, I am hoping to use the segment utility to obtain this format in the first place.

Should I simply dump my whole text in a single .lab file and treat everything as a single speaker?

Describe the solution you'd like A concrete example using mfa segment to split audio/text.

mmcauliffe commented 9 months ago

Right, it would just be a single .lab file per audio file with all the utterances collapsed across speakers, it also doesn't use any speaker adaptation, so the alignment it does internally is just the first pass compared to the two passes of mfa align. Unfortunately, you'd have to re-add the speaker information afterward based on the contents.

gadese commented 9 months ago

Hi @mmcauliffe , thanks for the quick response.

I might be missing something, but either the transcript I have doesn't seem to be segmenting correctly or I misunderstood what MFA's segment does.

I've created a corpus with this structure

project-dir
  |--full_file.mp3
  |--full_file.lab

Where

the .mp3 is my full audio file. I've tried feeding a .wav file instead, but it complained of not finding "full_file.mp3"
the .lab is the concatenated text for every speaker. With my previous example, the .lab file would look like this

This is a sample sentence. This is a sentence from Speaker 2. Another sentence from speaker 1.

The I call mfa segment:

mfa segment /PATH/TO/PROJECT/project-dir english_us_arpa english_us_arpa /PATH/TO/OUTPUT/ --speechbrain --no_cuda -v

However, the generated output file does not contain any of my text. I'm getting a TextGrid file like this with nothing but text = "speech". Is this expected behavior?:

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 5993.112125 
tiers? <exists> 
size = 1 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "full_file" 
        xmin = 0 
        xmax = 5993.112125 
        intervals: size = 1765 
        intervals [1]:
            xmin = 0 
            xmax = 146.63999389648438 
            text = "" 
        intervals [2]:
            xmin = 146.63999389648438 
            xmax = 147.70000244140624 
            text = "speech" 
        intervals [3]:
            xmin = 147.70000244140624 
            xmax = 149.74 
            text = "" 
        intervals [4]:
            xmin = 149.74 
            xmax = 150.06999755859374 
            text = "speech" 
...

As a sidenote, I also had issues with the output directory. The command simply ignored the specified output path and wrote the output to my CWD/english_us_arpa/full_file.TextGrid

mmcauliffe commented 9 months ago

Can you double check your version of MFA and update if necessary? The original functionality of "segment" was to just do VAD and detect "speech", but that's not super useful, so I migrated that functionality to mfa segment_vad and had the transcription-based segmentation take over mfa segment in 3.0.0a4: https://montreal-forced-aligner.readthedocs.io/en/latest/changelog/changelog_3.0.html#a4.

gadese commented 9 months ago

That did the trick, I was on the previous stable release (2.2.17). Coincidentally it also fixed my issue with the output directory.

Thank you, closing the issue for now.

MontrealCorpusTools / Montreal-Forced-Aligner

Documentation for segmenting audio with provided transcript #716