DanRuta / xva-trainer

UI app for training TTS/VC machine learning models for xVASynth, with several audio pre-processing tools, and dataset creation/management.
92 stars 17 forks source link

xVATrainer

xVATrainer is the companion app to xVASynth, the AI text-to-speech app using video game voices. xVATrainer is used for creating the voice models for xVASynth, and for curating and pre-processing the datasets used for training these models. With this tool, you can provide new voices for mod authors to use in their projects.

v1.0 Showcase/overview:

<img src="https://img.youtube.com/vi/PXv_SeTWk2M/0.jpg" alt="xVASynth YouTube demo" width="480" height="360" border="10" />

Links: Steam Nexus Discord Patreon

Check the descriptions on the nexus page for the most up-to-date information.

--

There are three main components to xVATrainer:

Dataset annotation

The main screen of xVATrainer contains a dataset explorer, which gives you an easy way to view, analyse, and adjust the data samples in your dataset. It further provides recording capabilities, if you need to record a dataset of your own voice, straight through the app, into the correct format.

Tools

There are several data pre-processing tools included in xVATrainer, to help you with almost any data preparation work you may need to do, to prepare your datasets for training. There is no step-by-step order that they need to be operated in, so long as your datasets end up as 22050Hz mono wav files of clean speech audio, up to about 10 seconds in length, with an associated transcript file with each audio file's transcript. Depending on what sources your data is from, you can pick which tools you need to use, to prepare your dataset to match that format. The included tools are:

Trainer

xVATrainer contains AI model training, for the FastPitch1.1 (with a custom modified training set-up), and HiFi-GAN models (the xVASynth "v2" models). The training follows a multi-stage approach especially optimized for maximum transfer learning (fine-tuning) quality. The generated models are exported into the correct format required by xVASynth, ready to use for generating audio with.

Batch training is also supported, allowing you to queue up any number of datasets to train, with cross-session persistence. The training panel shows a cmd-like textual log of the training progress, a tensorboard-like visual graph for the most relevant metrics, and a task manager-like set of system resources graphs.

You don't need any programming or machine learning experience. The only required input is to start/pause/stop the training sessions, and everything within is automated.

Setting up the development environment

Note: these installation instructions are for Windows. Use the requirements_linux_py3_10_6.txt file for Linux installation.

Create the environment using virtualenv and python 3.10 (pre-requisite: Python 3.10) virtualenv envXVATrainerP310 --python=C:\Users\Dan\AppData\Local\Programs\Python\Python310\python.exe

Activate your environment. Do this every time you launch a new terminal to work with xVATrainer envXVATrainerP310\Scripts\activate

Install PyTorch v2.0.x with CUDA. Get the v2.0 link from the pytorch website (pre-requisite: CUDA drivers from nvidia) eg (might be outdated): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install the dependencies through pip: pip install -r reqs_gpu.txt

Copy the folders from ./lib/_dev into the environment (into envXVATrainerP310/Lib/site-packages). These are some library files which needed custom modifications/bug fixes to integrate with everything else. Overwrite as necessary

Make sure that you're using librosa==0.8.1 (check with pip list. uninstall and re-install with this version if not)

Contribuiting

If you'd like to help improve xVATrainer, get in touch (eg an Issue, though best on Discord), and let me know. The main areas of interest for community contributions are (though let me know your ideas!):

A current issue/bug is that I can't get the HiFi-GAN to train with num_workers>0, as the training always cuts out after a deterministic amount of time - maybe something to do with the training loop being inside a secondary thread (FastPitch works fine though). Help with this would be especially welcome.

In a similar vein (and maybe caused by the same issue), I can only set tools' maximum number of multi-processing workers to just under half the total CPU thread count, else it can only run once, after which the app needs re-starting to use again.