[BUG] Tokenizer Fails on CommonVoice Japanese

NataliaShmueli commented 1 year ago

Debugging checklist

[x ] Have you updated to latest MFA version? [x ] Have you tried rerunning the command with the --clean flag?

Describe the issue A clear and concise description of what the bug is. The tokenizer failed on Japanese CommonVoice. When I tried it on even an individual speaker, it also failed. When I finally moved the test single speaker recordings to a folder that I named JaTest, it ended up working. This issue only happens with CommonVoice, so it might be related to the length of the folder name, of which was originally dbc3652a5a930b462947cfb0c88dd9ddb3ebe1c0cde73e7a020831c266f57ae464867e65ee452b1dbf2d034a39db03bab2773545ad809e2a2d209ed613492af8 For Reproducing your issue Please fill out the following:

Corpus structure
- What language is the corpus in?
- Japanese
- How many files/speakers?
- 1518
- Are you using lab files or TextGrid files for input?
- .lab
Dictionary
- Are you using a dictionary from MFA? If so, which one?
- N/A
- If it's a custom dictionary, what is the phoneset?
- N/A
Acoustic model
- If you're using an acoustic model, is it one download through MFA? If so, which one?
- japanese_mfa
- If it's a model you've trained, what data was it trained on?
- N/A

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA). ja.log

Desktop (please complete the following information):

OS: [e.g. Windows, OSX, Linux]
Windows
Version [e.g. MacOSX 10.15, Ubuntu 20.04, Windows 10, etc]
10
Any other details about the setup (Cloud, Docker, etc)

Additional context Add any other context about the problem here. TL;DR might be an issue with the length or naming scheme of folders.

mmcauliffe commented 1 year ago

Yeah, so Windows has a maximum path length of 260 (https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#maximum-path-length-limitation), so if you have nested common voice in some deep folder structure, then you'll hit this. You can move the directory to somewhere closer to the drive root (i.e. C:/common_voice_jp) and it should work. I'll think about ways that MFA could get around it, but it is ultimately a windows issue.

For reference, the path I use for it is D:\Data\speech\model_training_corpora\japanese\common_voice_ja

NataliaShmueli commented 1 year ago

Strangely enough, this has never been an issue for training/aligning, I don't think? I checked online for the length and it was only 181 characters at max.

K:\Training_Models\Spoken\Japanese\CommonVoice\cv\ja\1af9f4b197c3b75b95b91661651d490a1ce31d182b462702bc7613842a00146835a16b7d7d28c1e0e8e366c41216e786cf8c155fcbdcaab3f8f7d99b4a9c09fe

NataliaShmueli commented 1 year ago

Adding one more thing, it's refusing to tokenize corpora with Japanese names. I had a dataset folder in Katakana, and renaming it to Romaji made it work. Not a major issue though!

MontrealCorpusTools / Montreal-Forced-Aligner

[BUG] Tokenizer Fails on CommonVoice Japanese #575