Aria-K-Alethia / BigCodec

Official implementation of the paper "BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec"
https://aria-k-alethia.github.io/bigcodec-demo/
MIT License
78 stars 4 forks source link

about the down sample rate #5

Open Liujingxiu23 opened 1 week ago

Liujingxiu23 commented 1 week ago

In the paper, the total downsampling rate is 200. Did you try bigger downsampling rate , for example 640 ? And how about the performance?

Aria-K-Alethia commented 1 week ago

Hi,

I haven't tried it yet, but you can easily try a bigger downsampling rate with my released code. Also, please note that increasing downsampling rate will reduce the bitrate.

Liujingxiu23 commented 1 week ago

@Aria-K-Alethia Thank you for your reply. And Did you test the CER of reconstruction waves? I try to train other single-codec model with bigger downsampling rate,for example 640, it seems that mispronunciation occurs sometimes.

Aria-K-Alethia commented 1 week ago

I didn't test CER, but note that BigCodec has the best STOI score as shown in the paper. I listened to many samples generated by BigCodec before, and at least I never encountered mispronunciation.

Liujingxiu23 commented 6 days ago

@Aria-K-Alethia Thank you for your reply! I tested the CER, the value is low, the performance is excellent! Another question, have to use this codec to do some downstream task, for example LLM / diffusion / flowmatching based Text to speech?

Aria-K-Alethia commented 5 days ago

Glad to hear it! As for the question, it's certainly possible for any downstream task. This is because as long as the codec can clearly reconstruct the speech, the tokens should be assumed to contain all information of the reconstructed speech.

wincing2 commented 5 days ago

@Aria-K-Alethia Thank you for your reply! I tested the CER, the value is low, the performance is excellent! Another question, have to use this codec to do some downstream task, for example LLM / diffusion / flowmatching based Text to speech?

Low CER, what's the downsample rate of the model you tested, 200 or 640? @Liujingxiu23

Liujingxiu23 commented 2 days ago

@wincing2 the CER of "model of hoplength=200" is good. For 640, the cer is high