Towards Achieving Robust Universal Neural Vocoding

A PyTorch implementation of Towards Achieving Robust Universal Neural Vocoding. Audio samples can be found here. Colab demo can be found here. Accompanying Tacotron implementation can be found here

^{Fig 1:Architecture of the vocoder.}

Quick Start

Ensure you have Python 3.6 and PyTorch 1.7 or greater installed. Then install the package with:

pip install univoc

Example Usage

import torch
import soundfile as sf
from univoc import Vocoder

# download pretrained weights (and optionally move to GPU)
vocoder = Vocoder.from_pretrained(
    "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt"
).cuda()

# load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
mel = ...

# generate waveform
with torch.no_grad():
    wav, sr = vocoder.generate(mel)

# save output
sf.write("path/to/save.wav", wav, sr)

Train from Scratch

Clone the repo:

git clone https://github.com/bshall/UniversalVocoding
cd ./UniversalVocoding

Install requirements:
```
pip install -r requirements.txt
```

Download and extract the LJ-Speech dataset:

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvjf LJSpeech-1.1.tar.bz2

Download the train split here and extract it in the root directory of the repo.

Extract Mel spectrograms and preprocess audio:

python preprocess.py in_dir=path/to/LJSpeech-1.1 out_dir=datasets/LJSpeech-1.1

Train the model:

python train.py checkpoint_dir=ljspeech dataset_dir=datasets/LJSpeech-1.1

Pretrained Models

Pretrained weights for the 10-bit LJ-Speech model are available here.

Notable Differences from the Paper

Trained on 16kHz audio from a single speaker. For an older version trained on 102 different speakers form the ZeroSpeech 2019: TTS without T English dataset click here.
Uses an embedding layer instead of one-hot encoding.

Acknowlegements

https://github.com/fatchord/WaveRNN

bshall / UniversalVocoding

readme