Open Liujingxiu23 opened 1 week ago
Hi,
I haven't tried it yet, but you can easily try a bigger downsampling rate with my released code. Also, please note that increasing downsampling rate will reduce the bitrate.
@Aria-K-Alethia Thank you for your reply. And Did you test the CER of reconstruction waves? I try to train other single-codec model with bigger downsampling rate,for example 640, it seems that mispronunciation occurs sometimes.
I didn't test CER, but note that BigCodec has the best STOI score as shown in the paper. I listened to many samples generated by BigCodec before, and at least I never encountered mispronunciation.
@Aria-K-Alethia Thank you for your reply! I tested the CER, the value is low, the performance is excellent! Another question, have to use this codec to do some downstream task, for example LLM / diffusion / flowmatching based Text to speech?
Glad to hear it! As for the question, it's certainly possible for any downstream task. This is because as long as the codec can clearly reconstruct the speech, the tokens should be assumed to contain all information of the reconstructed speech.
@Aria-K-Alethia Thank you for your reply! I tested the CER, the value is low, the performance is excellent! Another question, have to use this codec to do some downstream task, for example LLM / diffusion / flowmatching based Text to speech?
Low CER, what's the downsample rate of the model you tested, 200 or 640? @Liujingxiu23
@wincing2 the CER of "model of hoplength=200" is good. For 640, the cer is high
In the paper, the total downsampling rate is 200. Did you try bigger downsampling rate , for example 640 ? And how about the performance?