Closed xiaoli1996 closed 2 years ago
The final bitrate of VQ-VAE 800bps codecs is the sum of the pitch bitrate with the content bitrate?
Edited:
We use the entropy of a simple English phonemes classifier: if the average entropy on a sequence was high, then it means that the classifiers struggled to identify clear phones, and therefore it was dirty. The full procedure is described here :https://hal.archives-ouvertes.fr/hal-03070411/document.
The full list of filenames is available in the repo: https://github.com/facebookresearch/speech-resynthesis/tree/main/datasets/LibriLight
- The vqvae encoder was trained on 6K hours from libri-light, built from the small+medium sections of the audio.
- The final bitrate of the vqvae+f0 is 865. 800bps for vqvae and 65bps for the f0 stream.
Thank you very much.
- The vqvae encoder was trained on 6K hours from libri-light, built from the small+medium sections of the audio.
- The final bitrate of the vqvae+f0 is 865. 800bps for vqvae and 65bps for the f0 stream. In the MUSHRA subjective results of your paper, VQ-VAE 800bps does not include the F0 encoder and speaker encoder?
Re libri-light - see the edited comment above with more details.
Yes, the vqvae model evaluated in our MUSHRA experiment is without F0 and speaker encoders.
Our MUSHRA experiments evaluated our method as an ultra-lighweight speech codec. Therefore, we compared versus the method specified in this paper: https://arxiv.org/pdf/1910.06464.pdf
es, the vqvae model evaluated in our MUSHRA experiment is without F0 and speaker encoders.
Thank you again
In the paper, "The VQ-VAE model employs the HiFiGAN decoder trained on the LibriLight dataset to match the amount of data reported in [34]." How many hours of LibriLight were used in the training?