This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e. the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e. encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48 kHz speech signals while operating at only 12 kbps and running with less than 6 ms (GPU)/10 ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. [paper] [demo]
This repository is tested on Ubuntu 20.04 using a V100 and the following settings.
$ python -m sounddevice
# The LibriTTS model is recommended for arbitrary microphones because of the robustness of microphone channel mismatches.
# Set up the I/O devices according to the list of I/O devices
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1
$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym
## Run codec demo with files
1. Please download the whole [exp](https://github.com/facebookresearch/AudioDec/releases/download/pretrain_models_v02/exp.zip) folder and put it in the AudioDec project directory.
2. Run the demo
```bash
## VCTK 48000Hz models
$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav
## LibriTTS 24000Hz model
$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav
analyzer
and stats
in
config/statistic/symAD_vctk_48000_hop300_clean.yaml
# stage 0: training autoencoder from scratch
# stage 1: extracting statistics
# stage 2: training vocoder from scratch
# stage 3: testing (symAE)
# stage 4: testing (AE + Vocoder)
$ bash submit_codec.sh --start 0 --stop 4 \ --autoencoder "autoencoder/symAD_vctk_48000_hop300" \ --statistic "stati/symAD_vctk_48000_hop300_clean" \ --vocoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"
## Training and testing only the AutoEncoder
1. Prepare the training/validation/test utterances and modify the paths
2. Follow the usage instructions in **submit_autoencoder.sh** to run the training and testing
```bash
# Train AutoEncoder from scratch
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
# Resume AutoEncoder from previous iterations
$ bash submit_autoencoder.sh --stage 1 \
--tag_name "autoencoder/symAD_vctk_48000_hop300" \
--resumepoint 200000
# Test AutoEncoder
$ bash submit_autoencoder.sh --stage 2 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
--subset "clean_test"
All pre-trained models can be accessed via exp (only the generators are provided).
AutoEncoder | Corpus | Fs | Bitrate | Path |
---|---|---|---|---|
symAD | VCTK | 48 kHz | 24 kbps | exp/autoencoder/symAD_c16_vctk_48000_hop320 |
symAAD | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symAAD_vctk_48000_hop300 |
symAD | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symAD_vctk_48000_hop300 |
symAD_univ | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symADuniv_vctk_48000_hop300 |
symAD | LibriTTS | 24 kHz | 6.4 kbps | exp/autoencoder/symAD_libritts_24000_hop300 |
Vocoder | Corpus | Fs | Path |
---|---|---|---|
AD v0 | VCTK | 48 kHz | exp/vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean |
AD v1 | VCTK | 48 kHz | exp/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean |
AD v2 | VCTK | 48 kHz | exp/vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean |
AD_univ | VCTK | 48 kHz | exp/vocoder/AudioDec_v3_symADuniv_vctk_48000_hop300_clean |
AD v1 | LibriTTS | 24 kHz | exp/vocoder/AudioDec_v1_symAD_libritts_24000_hop300_clean |
# Update the Encoder for denoising
$ bash submit_denoise.sh --stage 0 \
--tag_name "denoise/symAD_vctk_48000_hop300"
$ bash submit_denoise.sh --stage 2 \ --encoder "denoise/symAD_vctk_48000_hop300" --decoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean" --encoder_checkpoint 200000 --decoder_checkpoint 500000 --subset "noisy_test"
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise
$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise
## Citation
If you find the code helpful, please cite the following article.
@INPROCEEDINGS{10096509, author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander}, booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec}, year={2023}, doi={10.1109/ICASSP49357.2023.10096509}}
## References
The AudioDec repository is developed based on the following repositories.
- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)
- [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
- [lucidrains/vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch)
- [chomeyama/SiFiGAN](https://github.com/chomeyama/SiFiGAN)
## License
The majority of "AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec" is licensed under CC-BY-NC, however, portions of the project are available under separate license terms: https://github.com/kan-bayashi/ParallelWaveGAN, https://github.com/lucidrains/vector-quantize-pytorch, https://github.com/jik876/hifi-gan, https://github.com/r9y9/wavenet_vocoder, and https://github.com/chomeyama/SiFiGAN are licensed under the MIT license.
## FQ&A
1. **Have you compared AudioDec with Encodec?**
Please refer to the [discussion](https://github.com/facebookresearch/AudioDec/issues/1).
2. **Have you compared AudioDec with other non-neural-network codecs such as Opus?**
Since this paper focuses on providing a well-developed streamable neural codec implementation with an efficient training paradigm and modularized architecture, we only compared AudioDec with SoundStream.
3. **Can you also release the pre-trained discriminators?**
For many applications such as denoising, updating only the encoder achieves almost the same performance as updating the whole model. For applications involving decoder updating such as binaural rending, it might be better to design specific discriminators for that application. Therefore, we release only the generators.
4. **Can AudioDec encode/decode multi-channel signals?**
Yes, you can train a MIMO model by changing the input_channels and output_channels in the config. One lesson I learned in training a MIMO model is that although the generator is MIMO, reshaping the generator output signal to mono for the following discriminator will markedly improve the MIMO audio quality.