[!IMPORTANT] This is an unofficial implementation of the paper High Fidelity Neural Audio Compression in PyTorch.
The LibriTTS960h 24khz encodec checkpoint and disc checkpoint is release in https://huggingface.co/zkniu/encodec-pytorch/tree/main
I hope we can get together to do something meaningful and rebuild encodec in this repo.
This repository is based on encodec and EnCodec_Trainer.
Based on the EnCodec_Trainer, I have made the following changes:
The code is tested on the following environment:
In order to you can run the code, you can install the environment by the help of requirements.txt.
I use the librispeech as the train datasets and use the datasets/generate_train_file.py
generate train csv which is used in the training process. You can check the datasets/generate_train_file.py
and customAudioDataset.py
to understand how to prepare your own dataset.
Also you can use ln -s
to link the dataset to the datasets
folder.
I provide a dockerfile to build a docker image with all the necessary dependencies.
docker build -t encodec:v1 .
# CPU running
docker run encodec:v1 <command> # you can add some parameters, such as -tid
# GPU running
docker run --gpus=all encodec:v1 <command>
You can use the following command to train the model using multi gpu:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_multi_gpu.py \
distributed.torch_distributed_debug=False \
distributed.find_unused_parameters=True \
distributed.world_size=4 \
common.save_interval=2 \
common.test_interval=2 \
common.max_epoch=100 \
datasets.tensor_cut=100000 \
datasets.batch_size=8 \
datasets.train_csv_path=YOUR TRAIN DATA.csv \
lr_scheduler.warmup_epoch=20 \
optimization.lr=5e-5 \
optimization.disc_lr=5e-5 \
Note:
datasets.tensor_cut
, you can set a large datasets.batch_size
to speed up the training process.RuntimeError(f"Mismatch in number of params: ours is {len(params)}, at least one worker has a different one.")
. You can use a small datasets.tensor_cut
to solve this problem.torch.stft(return_complex)
in the audio_to_mel.py
distributed.torch_distributed_debug=True
to get more message about this problem.distributed.data_parallel=False
parameter to the command, like this:
python train_multi_gpu.py distributed.data_parallel=False
common.save_interval=5 \
common.max_epoch=100 \
datasets.tensor_cut=72000 \
datasets.batch_size=4 \
datasets.train_csv_path=YOUR TRAIN DATA.csv \
lr_scheduler.warmup_epoch=10 \
optimization.lr=5e-5 \
optimization.disc_lr=5e-5 \
compression.sh
to test your model in every log_interval epoch.nan
, and I use L2 norm to normalized quantize and x, like the code -> actually, it's unstable.
quantize = F.normalize(quantize)
commit_loss = F.mse_loss(quantize.detach(), x)
Usage will depend on your cluster setup, but see scripts/train.sbatch
for an example. This uses a container with the dependencies installed. Run sbatch scripts/train.sbatch
from the repository root to use.
I have add a shell script to compress and decompress the audio by different bandwidth, you can use the compression.sh
to test your model.
The script can be used as follows:
sh compression.sh INPUT_WAV_FILE [MODEL_NAME] [CHECKPOINT]
encodec_24khz
,support encodec_48khz
, my_encodec
,encodec_bw
my_encodec
,you can point out the checkpointif you want to test the model at a specific bandwidth, you can use the following command:
python main.py -r -b [bandwidth] -f [INPUT_FILE] [OUTPUT_WAV_FILE] -m [MODEL_NAME] -c [CHECKPOINT]
main.py from the encodec , you can use the -h
to check the help information.
Thanks to the following repositories:
The code is same as encodec LICENSE.