For VQ-VAE coding at 800 bps, the "small" and "medium" audio about 6000 hours were used for training?

facebookresearch / speech-resynthesis

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Other

390 stars 54 forks source link

For VQ-VAE coding at 800 bps, the "small" and "medium" audio about 6000 hours were used for training? #6

Closed xiaoli1996 closed 2 years ago

xiaoli1996 commented 2 years ago

In the paper, "The VQ-VAE model employs the HiFiGAN decoder trained on the LibriLight dataset to match the amount of data reported in [34]." How many hours of LibriLight were used in the training?

xiaoli1996 commented 2 years ago

The final bitrate of VQ-VAE 800bps codecs is the sum of the pitch bitrate with the content bitrate？

adampolyak commented 2 years ago

Edited:

For the libri-light dataset - we use a specific subset, which we call "clean" in the paper. The dataset was subsampled using the following procedure:

We use the entropy of a simple English phonemes classifier: if the average entropy on a sequence was high, then it means that the classifiers struggled to identify clear phones, and therefore it was dirty. The full procedure is described here :https://hal.archives-ouvertes.fr/hal-03070411/document.

The full list of filenames is available in the repo: https://github.com/facebookresearch/speech-resynthesis/tree/main/datasets/LibriLight

The final bitrate of the vqvae+f0 is 865. 800bps for vqvae and 65bps for the f0 stream.

xiaoli1996 commented 2 years ago

The vqvae encoder was trained on 6K hours from libri-light, built from the small+medium sections of the audio.

The final bitrate of the vqvae+f0 is 865. 800bps for vqvae and 65bps for the f0 stream.

Thank you very much.

xiaoli1996 commented 2 years ago

The vqvae encoder was trained on 6K hours from libri-light, built from the small+medium sections of the audio.

The final bitrate of the vqvae+f0 is 865. 800bps for vqvae and 65bps for the f0 stream. In the MUSHRA subjective results of your paper, VQ-VAE 800bps does not include the F0 encoder and speaker encoder?

adampolyak commented 2 years ago

Re libri-light - see the edited comment above with more details.

Yes, the vqvae model evaluated in our MUSHRA experiment is without F0 and speaker encoders.

Our MUSHRA experiments evaluated our method as an ultra-lighweight speech codec. Therefore, we compared versus the method specified in this paper: https://arxiv.org/pdf/1910.06464.pdf

xiaoli1996 commented 2 years ago

es, the vqvae model evaluated in our MUSHRA experiment is without F0 and speaker encoders.

Thank you again