MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 247 forks source link

csv output gloms words together [BUG] #521

Closed coreymillerrev closed 1 year ago

coreymillerrev commented 2 years ago

Debugging checklist

[ x] Have you updated to latest MFA version? 2.0.6 [x ] Have you tried rerunning the command with the --clean flag?

Describe the issue When I run with output_format=csv, some words that are separated by space in the text wind up on the same row glommed together. For example, "bonjour j'organise" in the text became "bonjourj'organise" in the csv output.

For Reproducing your issue mfa align testinput1 french_mfa french_mfa aligntest1csvoutput --clean --output_format=csv

testinput1 contains ftelpv29_chunk1.txt and ftelpv29_chunk1.wav (attached) aligntest1csvoutput contains ftelpv29_chunk1.csv (attached)

Row 3 of .csv has the glommed together word

Please fill out the following:

  1. Corpus structure
    • What language is the corpus in? French
    • How many files/speakers? 1 file, 2 speakers
    • Are you using lab files or TextGrid files for input? lab files
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one? french_mfa dictionary
    • If it's a custom dictionary, what is the phoneset?
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one? french_mfa
    • If it's a model you've trained, what data was it trained on?

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA).

attached

Desktop (please complete the following information):

Additional context Add any other context about the problem here. pretrained_aligner.log /github.com/MontrealCorpusTools/Montreal-Forced-Aligner/files/9902261/pretrained_aligner.log) ftelpv29_chunk1.csv

coreymillerrev commented 1 year ago

This appears to be an issue with clitic_markers, not json or csv format. When clitic_markers = "" (null), it comes out right.