MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 247 forks source link

[BUG] Tokenizer Fails on CommonVoice Japanese #575

Closed NataliaShmueli closed 1 year ago

NataliaShmueli commented 1 year ago

Debugging checklist

[x ] Have you updated to latest MFA version? [x ] Have you tried rerunning the command with the --clean flag?

Describe the issue A clear and concise description of what the bug is. The tokenizer failed on Japanese CommonVoice. When I tried it on even an individual speaker, it also failed. When I finally moved the test single speaker recordings to a folder that I named JaTest, it ended up working. This issue only happens with CommonVoice, so it might be related to the length of the folder name, of which was originally dbc3652a5a930b462947cfb0c88dd9ddb3ebe1c0cde73e7a020831c266f57ae464867e65ee452b1dbf2d034a39db03bab2773545ad809e2a2d209ed613492af8 For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in?
    • Japanese
    • How many files/speakers?
    • 1518
    • Are you using lab files or TextGrid files for input?
    • .lab
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one?
    • N/A
    • If it's a custom dictionary, what is the phoneset?
    • N/A
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one?
    • japanese_mfa
    • If it's a model you've trained, what data was it trained on?
    • N/A

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA). ja.log

Desktop (please complete the following information):

Additional context Add any other context about the problem here. TL;DR might be an issue with the length or naming scheme of folders.

mmcauliffe commented 1 year ago

Yeah, so Windows has a maximum path length of 260 (https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#maximum-path-length-limitation), so if you have nested common voice in some deep folder structure, then you'll hit this. You can move the directory to somewhere closer to the drive root (i.e. C:/common_voice_jp) and it should work. I'll think about ways that MFA could get around it, but it is ultimately a windows issue.

For reference, the path I use for it is D:\Data\speech\model_training_corpora\japanese\common_voice_ja

NataliaShmueli commented 1 year ago

Strangely enough, this has never been an issue for training/aligning, I don't think? I checked online for the length and it was only 181 characters at max.

K:\Training_Models\Spoken\Japanese\CommonVoice\cv\ja\1af9f4b197c3b75b95b91661651d490a1ce31d182b462702bc7613842a00146835a16b7d7d28c1e0e8e366c41216e786cf8c155fcbdcaab3f8f7d99b4a9c09fe

NataliaShmueli commented 1 year ago

Adding one more thing, it's refusing to tokenize corpora with Japanese names. I had a dataset folder in Katakana, and renaming it to Romaji made it work. Not a major issue though!