Closed gadese closed 9 months ago
Right, it would just be a single .lab file per audio file with all the utterances collapsed across speakers, it also doesn't use any speaker adaptation, so the alignment it does internally is just the first pass compared to the two passes of mfa align
. Unfortunately, you'd have to re-add the speaker information afterward based on the contents.
Hi @mmcauliffe , thanks for the quick response.
I might be missing something, but either the transcript I have doesn't seem to be segmenting correctly or I misunderstood what MFA's segment does.
I've created a corpus with this structure
project-dir
|--full_file.mp3
|--full_file.lab
Where
This is a sample sentence. This is a sentence from Speaker 2. Another sentence from speaker 1.
The I call mfa segment:
mfa segment /PATH/TO/PROJECT/project-dir english_us_arpa english_us_arpa /PATH/TO/OUTPUT/ --speechbrain --no_cuda -v
However, the generated output file does not contain any of my text. I'm getting a TextGrid file like this with nothing but text = "speech". Is this expected behavior?:
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 5993.112125
tiers? <exists>
size = 1
item []:
item [1]:
class = "IntervalTier"
name = "full_file"
xmin = 0
xmax = 5993.112125
intervals: size = 1765
intervals [1]:
xmin = 0
xmax = 146.63999389648438
text = ""
intervals [2]:
xmin = 146.63999389648438
xmax = 147.70000244140624
text = "speech"
intervals [3]:
xmin = 147.70000244140624
xmax = 149.74
text = ""
intervals [4]:
xmin = 149.74
xmax = 150.06999755859374
text = "speech"
...
As a sidenote, I also had issues with the output directory. The command simply ignored the specified output path and wrote the output to my CWD/english_us_arpa/full_file.TextGrid
Can you double check your version of MFA and update if necessary? The original functionality of "segment" was to just do VAD and detect "speech", but that's not super useful, so I migrated that functionality to mfa segment_vad
and had the transcription-based segmentation take over mfa segment
in 3.0.0a4: https://montreal-forced-aligner.readthedocs.io/en/latest/changelog/changelog_3.0.html#a4.
That did the trick, I was on the previous stable release (2.2.17). Coincidentally it also fixed my issue with the output directory.
Thank you, closing the issue for now.
Is your feature request related to a problem? Please describe. I'd like to try the Segment Transcript utility. I am dealing with very long audio (between 25 and 90 minutes) including multiple speakers, and I already have the associated text in the form of a script (.txt file)
For example:
I've read the docs and tutorials, but this segment utility really seems under documented, specifically the format for the input corpus. It's unclear to me how to format my transcript (and the directory structure).
For pre-segmented audio/text it's pretty clear that I should be following the prescribed corpus format. But in my case, I am hoping to use the segment utility to obtain this format in the first place.
Should I simply dump my whole text in a single .lab file and treat everything as a single speaker?
Describe the solution you'd like A concrete example using mfa segment to split audio/text.