ZachB100 / Piper-Training-Guide-with-Screen-Reader

A guide to help newcomers to the Piper TTS system create voices for NVDA and other screen readers down the line.
19 stars 1 forks source link

difficulty starting training #1

Open musicalman opened 1 year ago

musicalman commented 1 year ago

Hi, I tried to train a model using Piper, but am running into trouble. My question is kind of a two part one: First, when I try and mount my Google Drive using the notebook, I get the following error: ValueError: Mountpoint must not already contain files So I'm wondering if it's broken for anyone else, or if I'm doing something wrong. Maybe I should be reporting this to someone else? Second, in your guide you said you could train using a CPU, though it would be much slower. Even so, I wouldn't mind giving it a shot, even if it takes a week to do it lol. I don't know how to go about this though, or if it's even possible with my setup (I'm on windows 10, btw).

Even though I am having trouble, I sincerely thank you for writing the guide and gathering these resources! You obviously put a lot of time and care into it, and I hope a lot of folks find it useful.

ZachB100 commented 1 year ago

Hi there, I just tried with the current notebook and I'm not having this issue so I'm not sure what might be going on there. It's worth noting that right now the notebook appears to be broken as something happened with the fork of piper that has been used, the author did not use the official version on GitHub as there were a few requirements that needed to be changed for it to work in Collab. Hopefully it can be updated later on today, but if not I will make my own fork and update the link to the notebook in the guide. Stay tuned.

On Sat, Jun 10, 2023 at 7:00 AM musicalman @.***> wrote:

Hi, I tried to train a model using Piper, but am running into trouble. My question is kind of a two part one: First, when I try and mount my Google Drive using the notebook, I get the following error: ValueError: Mountpoint must not already contain files So I'm wondering if it's broken for anyone else, or if I'm doing something wrong. Maybe I should be reporting this to someone else? Second, in your guide you said you could train using a CPU, though it would be much slower. Even so, I wouldn't mind giving it a shot, even if it takes a week to do it lol. I don't know how to go about this though, or if it's even possible with my setup (I'm on windows 10, btw).

Even though I am having trouble, I sincerely thank you for writing the guide and gathering these resources! You obviously put a lot of time and care into it, and I hope a lot of folks find it useful.

— Reply to this email directly, view it on GitHub https://github.com/ZachB100/Piper-Training-Guide-with-Screen-Reader/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2E7LYNETIXTY5MGT4TPCB3XKRHVNANCNFSM6AAAAAAZBTOAHY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

musicalman commented 1 year ago

Hi, As it turns out, I had to reinstall Windows today because of a driver issue that I couldn't figure out how to fix. Afterward, I tried the notebook again and mounting worked, so perhaps something was broken on my previous Windows installation? I can't think of what, though, since my web browser is portable.

Anyhow, I see what you mean by the notebook being broken; I couldn't commence training, I kept getting errors about shape mismatch or something. Do let me know when you get a working notebook!

One more question, when selecting a voice model to fine tune, is it recommended to select a voice that sounds closest to your data set, or is the language and quality all that matters?

ZachB100 commented 1 year ago

Hi again, it looks like the notebook has now been fixed, but I'm about to try it to make sure. You may be getting errors because you selected a model that has been trained at a different quality level than the model you are fine-tuning. For example, selecting a pre trained model at low quality but trying to fine-tune at medium quality will not work. In terms of model selection, I'm not really sure. The model you select definitely has a bearing on how the final output will turn out, so what I would recommend is to try and find something that performs well on its own without fine-tuning. Hope this helps, and let me know if you have any other questions.

On Sat, Jun 10, 2023 at 8:45 PM musicalman @.***> wrote:

Hi, As it turns out, I had to reinstall Windows today because of a driver issue that I couldn't figure out how to fix. Afterward, I tried the notebook again and mounting worked, so perhaps something was broken on my previous Windows installation? I can't think of what, though, since my web browser is portable.

Anyhow, I see what you mean by the notebook being broken; I couldn't commence training, I kept getting errors about shape mismatch or something. Do let me know when you get a working notebook!

One more question, when selecting a voice model to fine tune, is it recommended to select a voice that sounds closest to your data set, or is the language and quality all that matters?

— Reply to this email directly, view it on GitHub https://github.com/ZachB100/Piper-Training-Guide-with-Screen-Reader/issues/1#issuecomment-1585886443, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2E7LYL6XVHW4Y4NXXV6VFDXKUIMJANCNFSM6AAAAAAZBTOAHY . You are receiving this because you commented.Message ID: <ZachB100/Piper-Training-Guide-with-Screen-Reader/issues/1/1585886443@ github.com>

rmcpantoja commented 1 year ago

Hi @musicalman, Colab has nothing to do with any other part of the system, it's just something that runs through your browser. I think that at that moment you did not execute the cells in order and therefore the mount error. Regarding the errors in the pretrained model, now I've fixed more updated models, since piper was updated recently, an update which will have to retrain all the models that have been made by adding new IPA symbols. Now it should work. Regarding the selection, yes, for better results it's better to select a base model that has the gender of the voice of the dataset (male/female), but I see that there are not many of this type.

musicalman commented 1 year ago

Hi, Unfortunately, I am still having trouble training a model. I ran the cells in order, and used the default settings for preprocessing and training settings. I selected medium quality, and then I selected and downloaded the Joe medium (fine tuned) model for tuning. However, when I attempt to start training, I get the following error: Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/piper/src/python/piper_train/main.py", line 10, in from .vits.lightning import VitsModel File "/content/piper/src/python/piper_train/vits/lightning.py", line 312 "{self.test_audio}/{newtag}.wav", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ SyntaxError: invalid syntax Any ideas what I could be doing wrong or should double check? I'm stumped.

rmcpantoja commented 1 year ago

Hi, Unfortunately, I am still having trouble training a model. I ran the cells in order, and used the default settings for preprocessing and training settings. I selected medium quality, and then I selected and downloaded the Joe medium (fine tuned) model for tuning. However, when I attempt to start training, I get the following error: Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/piper/src/python/piper_train/main.py", line 10, in from .vits.lightning import VitsModel File "/content/piper/src/python/piper_train/vits/lightning.py", line 312 "{self.test_audio}/{newtag}.wav", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ SyntaxError: invalid syntax Any ideas what I could be doing wrong or should double check? I'm stumped.

Hi @musicalman, I'm very sorry for the inconvenience. It was my mistake adding improvements to the notebook, but I already fixed it a few hours ago. Now everything should work without problems.

musicalman commented 1 year ago

Hi, Yay, training was successful! I'm wondering though if some clarification can be given on how to get the best results? Maybe not though, since AI is one of those things that lends itself to experimenting. So, my data set consists of 50 minutes of a male speaker, reading vocabulary words, definitions and example sentences. There were 897 utterances in the data set. The order of utterances goes something like this: audio1|the word audio2|the definition audio3|the example sentence audio4|maybe another example sentence This pattern continues more or less throughout, though I got rid of some utterances that I didn't really like the sound of. So now we get to training. Collab let me train for about 4 hours before kicking me out. I used the default settings for training, and quality was set to medium. What I noticed was that the preview audio files started out sounding horrible but then improved. However, after about an hour or so of training, improvement was slower and less steady. One set of preview files would sound pretty good, but then the next set would sound worse. It felt like 3 steps forward, 2 steps back, for the last couple hours. The model never seemed to learn the full inflection of the narrator. It seems to know that the narrator occasionally inflects a little more than normal, but when it tries, the voice cracks and sounds weird. The voice is also full of pitch jumps and cracks in the middle of sentences, and overall sounds less fluent than any of the piper models I've tried. Part of me thinks I just don't have enough data, but I'm also wondering if I should've increased batch size? Maybe that would help it remember more of the subtler characteristics as it trains? I'm assuming the other settings, such as how many epochs to save training checkpoints, and Step interval to generate model samples, are only there to adjust amount of logging. The main settings I would worry about are batch size and quality, but input on this is welcome if possible! As you can probably tell I'm kind of excited to have this working. For what it's worth, the voice in NVDA works surprisingly better than I expected. It isn't pleasant to listen to sometimes, but most things are read clearly, including letters and single words.

rmcpantoja commented 1 year ago

Hi @musicalman, I think the best thing would be to train it for six hours (enough that colab offers to get something decent) and compare the results. Perhaps modifying the batch size is a solution too, it depends on how many audios your dataset has. for 897 audios I would recommend 12-16, although it seems very strange to me. I have got 50 minutes of dataset only with almost 300 audios, so I think that you have too short audios, and you should join them to have a more balanced dataset in terms of durations (8-15 seconds).