ZachB100 / Piper-Training-Guide-with-Screen-Reader

A guide to help newcomers to the Piper TTS system create voices for NVDA and other screen readers down the line.
23 stars 1 forks source link

Piper-Training-Guide-with-Screen-Reader

A guide to help newcomers to the Piper TTS system create voices for NVDA and other screen readers down the line.

Introduction

Welcome to this guide on training your own custom TTS voices using Piper a fast and local text to speech engine optimized for low end hardware such as the raspberry pie. Unlike a lot of TTS engines that blind people might be familiar with, piper is based on some of the latest advancements in machine learning for speech synthesis. Under the hood, it uses VITS, an end to end speech synthesis model, and exports voices to the Onyx runtime. We will be using Piper with NVDA a free and open source screen reader for the Windows operating system. Currently, this is the only screen reader that is supported, however as the synthesizer develops I'm sure more platforms will be added in the future.

With Piper being based on ML, a lot of people may understandably have concerns about its performance. As of right now, the model is what I would characterize as "acceptable" for a screen reader user. This will doubtless be improved as updates are made, but I personally think it's good enough to do basic web browsing, check email, read social media, etc. The AddOn we will be using also supports speeding the voices up quite considerably without audio degradation, however due to a bug in the latest release that renders the synth in operable, we will be using a slightly older version that has not yet implemented this functionality. I will update the guide once this is fixed.

Getting Things Ready

Before we have fun training a model, let's get everything set up first so we can test right away. Click here to download the AddOn directly. If you would like to learn more or get updates, click here. after installation, a dialogue will pop up when you next restart NVDA, explaining that you do not currently have voices installed and offering to take you to the Piper sample page where you can preview and download any voices you would like. Installing a new voice is very easy, simply go to the settings and locate the Piper category. You will then find a button to install voices from a local archive, and it's as simple as choosing the one you want and pressing enter. Take some time to play around with the different voices that are available and get a feel for how the synth operates, then come back and we'll get to the fun stuff.

Creating Datasets

In any ML task, collecting sufficient data is probably one of the most important things you can do to get a high quality result. For TTS, this data should include audio files split at sentence breaks of a single speaker reading from a script, along with an accompanying text transcript. Piper makes your job easier by keeping the format very simple. For the audio, you can either use 16 or 22.5 kHz mono .wav files at 16 bit resolution. For the text, you should format it according to the popular LJSpeech conventions, where the first column contains the name of your audio file with or without the extension, and the second column contains the text transcript, with each separated by a pipe character. The format looks like this: audio1|This is the first sentence. audio2| This is the second sentence. Note that unlike LJSpeech, you do not need to repeat the text transcript a second time, as that is reserved for multi speaker models when providing speaker IDs. Note also that it is totally fine to have the path to your audio files before the file name, but it is not necessary, as Piper will handle all of that for you behind the scenes.

Tips For Collecting Data

When working with TTS models, the quality of your data is very important. If your data has background noise, other people talking, etc, you will not get a satisfactory result, and the model will have a harder time learning the characteristics of your speaker. If possible, studio quality recordings are ideal for this kind of work, but even a laptop mic and a relatively low noise environment should do the trick as well. For scripts, try to find text that has wide phoneme coverage, such as public domain books or Wikipedia articles. I'll link to a few example Scripps at the end of this guide for you to get started. The amount of data to use is completely up to you, however I would recommend at least five minutes to begin. While that might sound like an insufficient amount to people familiar with older TTS systems, such as concatenation based synthesis, with machine learning it is quite different. You do not need nearly as much data to produce a high-quality result, and even just an hour of speech will get you a voice that sounds great in most scenarios. Obviously the more data the better, but don't feel like you have to get everything all at once. You can always go back and retrain the model later.

Training Your Model

To train, we Will be using a service by Google called collaboratory. This is a Jupiter notebook based environment that lets you get access to a high powered GPU in the cloud for free. Most TTS models based on ML require a high end machine with an Nvidia GPU for training, but can run on lower end hardware when synthesizing audio. Piper is no different, although it can also work on CPU when training, but it will be much slower. Here is a link to a Colab notebook that will allow you to perform the model training. After training is complete, you must export your model to the Onyx runtime environment for use with the speech engine, and this notebook will allow you to do that. The export notebook also includes a link to another one that will allow you to test your generated model by typing in text.

Uploading To Drive

Before opening the notebook, zip up your audio files into a folder and upload them to Google Drive. If you are unfamiliar with this process, you may need to do some research online based on the platform you're using, as I can not provide assistance for every environment someone might happen to be running. Note that there is a desktop app to make this process easier by simply copying files from File Explore or Finder directly to Google drive, so you may wish to consider this option. At least on Mac, it is likely that additional files will be created when you zip up your folder. To prevent this from happening, open your folder of audio files, select all of them with command a, then right click and choose compress X number of files from the menu that appears.

Using The Training Notebook

Installing Software

All of these notebooks are well designed and self-explanatory for the most part, so I will give a high-level overview of what you will need to do in order to train and test a model. When you open the training notebook, you will need to install some dependencies. Everything is split up by headings, so it's quite easy to navigate, if a little cluttered. At the top you will find a few cells to prevent Colab from disconnecting prematurely, as well as checking which GPU you have been assigned. It is up to you as to whether you wish to run these, and it will not adversely affect training if you don't. After this, you will find a cell to Mount your Google Drive to store model checkpoints. Next you will find a cell to install all necessary software for piper to run correctly. Simply click on the "Run Cell" button to the left of the installation section, and grant permission for the notebook to run when prompted. It is also wise to save a copy of this notebook in your drive in case changes are made later which break compatibility. You can do this from the file menu in the menu bar at the top.

While the cell runs, you should see output nearby that looks like a terminal. You are essentially using a Linux virtual machine under the hood that is hosted on Google servers, so if you're familiar with this OS a lot of the commands will be similar. I can't speak for Windows, but on Mac, cell output is contained in a frame next to each section, and before it you will see whether the cell is currently executing or has been executed. Wait until you see that the cell has finished running before proceeding to the next section. After installation has completed, you will need to restart the runtime before proceeding. To do this, find the relevant button in the cell output and click it.

uploading Files

After getting everything installed, it's time to upload your data set. Enter the path to your .zip file in the next cell, and click run. After this, upload your .csv file with the transcript. When uploading the transcription, find the upload button that will be located in the output frame of the relevant cell, and click it. Select the file using your OS file browser, and upon pressing upload the file will be saved immediately.

Pre-Processing And Configuring Training Settings

After everything has been successfully uploaded, it is time to pre-process your data set. Please be sure to look at every parameter carefully before clicking the run cell button. Among other things, you will be asked to choose a name for your model, the language, and the sample rate of your data. Note that for English, I would recommend choosing US English for now, as this will affect the availability of pre-trained models to fine-tune from, and currently LJ speech produces the best results in my testing. If you're training UK English, this will lead to some incorrect pronunciation, so feel free to experiment with British English if you wish.

After the pre-processing step has completed, you must configure various training settings for your model. You can leave most of these at the defaults, however you may wish to increase "validation split" to 0.05, depending on the size of your data set. This controls how much of your data is split off for evaluating Piper's performance as it trains. Another parameter to look at is the quality setting. In all of my tests so far, I have said this to medium, but you can experiment with other quality levels if you wish. After clicking the run button on this cell, in the output you will see a drop-down menu to select the model you wish to fine-tune. Be sure to select a model with the same quality level that you chose before in order for everything to work correctly.

Training

All right, finally it's time to train! If you wish, you can load up Tensorboard, a model evaluation dashboard created by Google, in order to listen to audio samples during training, however this is currently broken and I would not recommend using it for now. If you did everything right, clicking on the run button for the training cell will begin the process. Depending on the number of Epochs you have set, training may not stop by itself, but once you have checkpoints saved in Drive you can interrupt it at any time by clicking the run button again. The amount of time to train will depend on how much data you have, but I've gotten good results after about two hours or so. Feel free to experiment with this. Note that although there is a way to continue training later, I have not yet figured this out. I will update the guide once I do.

Exporting and Testing

Model files are saved in the working directory you specified earlier in this process. Under "Lightning_logs" you will find a checkpoints folder that will contain the most recent checkpoint as training progresses, and in the route of your model folder you will also find a config.json that you will need in order to export the finished voice. Note that initially these checkpoints are very large, so please keep an eye on your trash folder if you are using the free storage plan, and empty as needed.

Similar to the training notebook, the export and testing notebooks require you to install software before running them. Both of these notebooks include an accessibility feature to provide voice prompts as input from you is needed, and this is very helpful to keep you from having to constantly check the output. Everything is quite self-explanatory here, however I will briefly talk about creating links to your model files in order to export them. In Google Drive, you must create shareable links that can be viewed by everyone to pass into both notebooks, and this process will differ depending on the platform you are using. On the web, you can right click on a file and choose "copy link." Make sure that you choose "manage access" and change permission to "anyone with the link." Do this for both your checkpoint and configuration files, and paste them into the relevant text fields. Fill in the other parameters as you did for the training notebook, and click run. Please note that currently you must have the option to create a model card checked, otherwise the process will fail. A model card is simply a text file that explains information such as the sample rate and the data set you used, but you don't need to fill this in if you don't want to. After the model has been created, run the cell below to download it to your computer. Note that it will take some time for the download to begin as Colab storage is very slow, but for now this is how the notebook is configured.

If you don't use Windows or have NVDA but want to play around with your model and any others that people might share with you or that you find elsewhere, open up the testing notebook. This is very similar to the export notebook, in that you install the software and copy a link to your .tar file into the box provided. Just like the model exporter, voice prompts can be enabled to make the process a little easier. You also have control over the speech rate and a few parameters relating to variation. Feel free to play around with this as long as you like. Note that although the input text box is not multi line, you can paste any text you wish from the clipboard, and Piper will read it for you.

Conclusion

So that's it, you've trained your first Piper model. While the process does seem quite involved at first glance, as you go through I think you'll find that it's not as hard as you might imagine. If you have any questions or wish to make corrections to the guide, please don't hesitate to submit an issue or PR and I will get back to you ASAP. Thanks for reading, and I can't wait to see what you do with this technology. Happy training!