facebookresearch / AudioDec

An Open-source Streaming High-fidelity Neural Audio Codec
Other
445 stars 21 forks source link

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Highlights

Abstract

A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e. the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e. encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48 kHz speech signals while operating at only 12 kbps and running with less than 6 ms (GPU)/10 ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. [paper] [demo]

Two modes of AudioDec

  1. AutoEncoder (symmetric AudioDec, symAD)
    1-1. Train an AutoEncoder-based codec model from scratch with only metric loss(es) for the first 200k iterations.
    1-2. Fix the encoder, projector, quantizer, and codebook, and train the decoder with the discriminators for the following 500k iterations.
  2. AutoEncoder + Vocoder (AD v0,1,2) (recommended!)
    2-1. Extract the stats (global mean and variance) of the codes extracted by the trained Encoder.
    2-2. Train the vocoder with the trained Encoder and stats for 500k iterations.

NEWS

Requirements

This repository is tested on Ubuntu 20.04 using a V100 and the following settings.

Folder architecture

Run real-time streaming encoding/decoding demo

  1. Please download the whole exp folder and put it in the AudioDec project directory.
  2. Get the list of all I/O devices
    $ python -m sounddevice
  3. Run the demo
    
    # The LibriTTS model is recommended for arbitrary microphones because of the robustness of microphone channel mismatches.
    # Set up the I/O devices according to the list of I/O devices

w/ GPU

$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1

w/ CPU

$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym

The input and out audios will be dumped into input.wav and output.wav


## Run codec demo with files
1. Please download the whole [exp](https://github.com/facebookresearch/AudioDec/releases/download/pretrain_models_v02/exp.zip) folder and put it in the AudioDec project directory.  
2. Run the demo  
```bash
## VCTK 48000Hz models
$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav

## LibriTTS 24000Hz model
$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav

Training and testing the whole AudioDec pipeline

  1. Prepare the training/validation/test utterances and put them in three different folders
    ex: corpus/train, corpus/dev, and corpus/test
  2. Modify the paths (ex: /mnt/home/xxx/datasets) in
    submit_codec_vctk.sh
    config/autoencoder/symAD_vctk_48000_hop300.yaml
    config/statistic/symAD_vctk_48000_hop300_clean.yaml
    config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml
  3. Assign corresponding analyzer and stats in config/statistic/symAD_vctk_48000_hop300_clean.yaml
    config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml
  4. Follow the usage instructions in submit_codec_vctk.sh to run the training and testing
    
    # stage 0: training autoencoder from scratch
    # stage 1: extracting statistics
    # stage 2: training vocoder from scratch
    # stage 3: testing (symAE)
    # stage 4: testing (AE + Vocoder)

Run stages 0-4

$ bash submit_codec.sh --start 0 --stop 4 \ --autoencoder "autoencoder/symAD_vctk_48000_hop300" \ --statistic "stati/symAD_vctk_48000_hop300_clean" \ --vocoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"


## Training and testing only the AutoEncoder
1. Prepare the training/validation/test utterances and modify the paths 
2. Follow the usage instructions in **submit_autoencoder.sh** to run the training and testing
```bash
# Train AutoEncoder from scratch
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"

# Resume AutoEncoder from previous iterations
$ bash submit_autoencoder.sh --stage 1 \
--tag_name "autoencoder/symAD_vctk_48000_hop300" \
--resumepoint 200000

# Test AutoEncoder
$ bash submit_autoencoder.sh --stage 2 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
--subset "clean_test"

Pre-trained Models

All pre-trained models can be accessed via exp (only the generators are provided).

AutoEncoder Corpus Fs Bitrate Path
symAD VCTK 48 kHz 24 kbps exp/autoencoder/symAD_c16_vctk_48000_hop320
symAAD VCTK 48 kHz 12.8 kbps exp/autoencoder/symAAD_vctk_48000_hop300
symAD VCTK 48 kHz 12.8 kbps exp/autoencoder/symAD_vctk_48000_hop300
symAD_univ VCTK 48 kHz 12.8 kbps exp/autoencoder/symADuniv_vctk_48000_hop300
symAD LibriTTS 24 kHz 6.4 kbps exp/autoencoder/symAD_libritts_24000_hop300
Vocoder Corpus Fs Path
AD v0 VCTK 48 kHz exp/vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean
AD v1 VCTK 48 kHz exp/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean
AD v2 VCTK 48 kHz exp/vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean
AD_univ VCTK 48 kHz exp/vocoder/AudioDec_v3_symADuniv_vctk_48000_hop300_clean
AD v1 LibriTTS 24 kHz exp/vocoder/AudioDec_v1_symAD_libritts_24000_hop300_clean

Bonus Track: Denoising

  1. It is easy to perform denoising by just updating the encoder using noisy-clean pairs (keeping the decoder/vocoder the same).
  2. Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing
    
    # Update the Encoder for denoising
    $ bash submit_denoise.sh --stage 0 \
    --tag_name "denoise/symAD_vctk_48000_hop300"

Denoise

$ bash submit_denoise.sh --stage 2 \ --encoder "denoise/symAD_vctk_48000_hop300" --decoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean" --encoder_checkpoint 200000 --decoder_checkpoint 500000 --subset "noisy_test"

Stream demo w/ GPU

$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise

Codec demo w/ files

$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise


## Citation

If you find the code helpful, please cite the following article.

@INPROCEEDINGS{10096509, author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander}, booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec}, year={2023}, doi={10.1109/ICASSP49357.2023.10096509}}



## References
The AudioDec repository is developed based on the following repositories.

- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)
- [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
- [lucidrains/vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)
- [chomeyama/SiFiGAN](https://github.com/chomeyama/SiFiGAN)

## License
The majority of "AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec" is licensed under CC-BY-NC, however, portions of the project are available under separate license terms: https://github.com/kan-bayashi/ParallelWaveGAN, https://github.com/lucidrains/vector-quantize-pytorch, https://github.com/jik876/hifi-gan, https://github.com/r9y9/wavenet_vocoder, and https://github.com/chomeyama/SiFiGAN are licensed under the MIT license.

## FQ&A
1. **Have you compared AudioDec with Encodec?**  
 Please refer to the [discussion](https://github.com/facebookresearch/AudioDec/issues/1).
2. **Have you compared AudioDec with other non-neural-network codecs such as Opus?**  
Since this paper focuses on providing a well-developed streamable neural codec implementation with an efficient training paradigm and modularized architecture, we only compared AudioDec with SoundStream.
3. **Can you also release the pre-trained discriminators?**  
For many applications such as denoising, updating only the encoder achieves almost the same performance as updating the whole model. For applications involving decoder updating such as binaural rending, it might be better to design specific discriminators for that application. Therefore, we release only the generators.
4. **Can AudioDec encode/decode multi-channel signals?**  
Yes, you can train a MIMO model by changing the input_channels and output_channels in the config. One lesson I learned in training a MIMO model is that although the generator is MIMO, reshaping the generator output signal to mono for the following discriminator will markedly improve the MIMO audio quality.