encodec-pytorch

[!IMPORTANT] This is an unofficial implementation of the paper High Fidelity Neural Audio Compression in PyTorch.

The LibriTTS960h 24khz encodec checkpoint and disc checkpoint is release in https://huggingface.co/zkniu/encodec-pytorch/tree/main

I hope we can get together to do something meaningful and rebuild encodec in this repo.

Introduction

This repository is based on encodec and EnCodec_Trainer.

Based on the EnCodec_Trainer, I have made the following changes:

support multi-gpu training.
support AMP training (you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8)
- Couldn't work, so don't use amp
support hydra configuration management.
align the loss functions and hyperparameters.
support warmup scheduler in training.
support the test script to test the model.
support tensorboard to monitor the training process.
support 48khz and stereo models, thanks @leoauri in https://github.com/ZhikangNiu/encodec-pytorch/pull/22.
support slurm training, thanks @leoauri. in https://github.com/ZhikangNiu/encodec-pytorch/pull/22.
support loss balancer, thanks @leoauri. in https://github.com/ZhikangNiu/encodec-pytorch/pull/22.
You can find all the training scripts in scripts folder

Enviroments

The code is tested on the following environment:

Python 3.9
PyTorch 2.0.0 / PyTorch 1.13
GeForce RTX 3090 x 4 / V100-16G x 8 / A40 x 3 / A100 x 1

In order to you can run the code, you can install the environment by the help of requirements.txt.

Usage

Training

1. Prepare dataset

I use the librispeech as the train datasets and use the datasets/generate_train_file.py generate train csv which is used in the training process. You can check the datasets/generate_train_file.py and customAudioDataset.py to understand how to prepare your own dataset. Also you can use ln -s to link the dataset to the datasets folder.

[Optional] Docker image

I provide a dockerfile to build a docker image with all the necessary dependencies.

Building the image
```
docker build -t encodec:v1 .
```

Using the image

# CPU running
docker run encodec:v1 <command> # you can add some parameters, such as -tid
# GPU running
docker run --gpus=all encodec:v1 <command>

2. Train

You can use the following command to train the model using multi gpu:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train_multi_gpu.py \
                    distributed.torch_distributed_debug=False \
                    distributed.find_unused_parameters=True \
                    distributed.world_size=4 \
                    common.save_interval=2 \
                    common.test_interval=2 \
                    common.max_epoch=100 \
                    datasets.tensor_cut=100000 \
                    datasets.batch_size=8 \
                    datasets.train_csv_path=YOUR TRAIN DATA.csv \
                    lr_scheduler.warmup_epoch=20 \
                    optimization.lr=5e-5 \
                    optimization.disc_lr=5e-5 \

Note:

if you set a small datasets.tensor_cut, you can set a large datasets.batch_size to speed up the training process.
when you are training on your own dataset, I suggest you need to choose a moderate-length audio, because If you train your encodec with 1 senconds tensorcut in a small dataset and the encodec model dosen't perform well.
if you encounter bug about RuntimeError(f"Mismatch in number of params: ours is {len(params)}, at least one worker has a different one."). You can use a small datasets.tensor_cut to solve this problem.
if your torch version is lower 1.8, you need to check the default value of torch.stft(return_complex) in the audio_to_mel.py
if you encounter bug about multi-gpu training, you can try to set distributed.torch_distributed_debug=True to get more message about this problem.

the single gpu training method is similar to the multi-gpu training method, you only need to set the distributed.data_parallel=False parameter to the command, like this:

    python train_multi_gpu.py distributed.data_parallel=False
                        common.save_interval=5 \
                        common.max_epoch=100 \
                        datasets.tensor_cut=72000 \
                        datasets.batch_size=4 \
                        datasets.train_csv_path=YOUR TRAIN DATA.csv \
                        lr_scheduler.warmup_epoch=10 \
                        optimization.lr=5e-5 \
                        optimization.disc_lr=5e-5 \

the loss is not converged to zero, but the model can be used to compress and decompress the audio. you can use the compression.sh to test your model in every log_interval epoch.
the original paper dataset is larger than 17000h, but I only use LibriTTS960h to train the model, so the model is not good enough. If you want to train a better model, you can use the larger dataset.
The code is not well tested, so there may be some bugs. If you encounter any problems, you can open an issue or contact me by email.
When I add AMP training, I found the RVQ loss always be nan, and I use L2 norm to normalized quantize and x, like the code -> actually, it's unstable.
```
    quantize = F.normalize(quantize)  
    commit_loss = F.mse_loss(quantize.detach(), x)
```
When you try to use amp training, you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8
I suggest you need to focus on the generator loss, the commit loss it could be not converge, you can check some objective metrics about pesq, stoi.

Slurm

Usage will depend on your cluster setup, but see scripts/train.sbatch for an example. This uses a container with the dependencies installed. Run sbatch scripts/train.sbatch from the repository root to use.

Test

I have add a shell script to compress and decompress the audio by different bandwidth, you can use the compression.sh to test your model.

The script can be used as follows:

sh compression.sh INPUT_WAV_FILE [MODEL_NAME] [CHECKPOINT]

INPUT_WAV_FILE is the wav file you want to test
MODEL_NAME is the model name, default is encodec_24khz,support encodec_48khz, my_encodec,encodec_bw
CHECKPOINT is the checkpoint path, when your MODEL_NAME is my_encodec,you can point out the checkpoint

if you want to test the model at a specific bandwidth, you can use the following command:

python main.py -r -b [bandwidth] -f [INPUT_FILE] [OUTPUT_WAV_FILE] -m [MODEL_NAME] -c [CHECKPOINT]

main.py from the encodec , you can use the -h to check the help information.

Acknowledgement

Thanks to the following repositories:

LICENSE

The code is same as encodec LICENSE.

ZhikangNiu / encodec-pytorch

readme