Aria-K-Alethia / BigCodec

Official implementation of the paper "BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec"
https://aria-k-alethia.github.io/bigcodec-demo/
MIT License
84 stars 4 forks source link

Questions about details #3

Closed hbwu-ntu closed 2 months ago

hbwu-ntu commented 2 months ago

Hi @Aria-K-Alethia thank you for the amazing work. May I ask several questions:

  1. What is the model size for small-enc in Table 3?
  2. What is the receptive field of the encoder in terms of second?
  3. For w/o LSTM, what if you increase the dilation or kernel size of each conv layer to enlarge the receptive field to a reasonable level (such as 200 ms)? I guess if the receptive field is large enough, the model w/o LSTM may perform similarly to the one with LSTM, as shown in some speech enhancement models (UNet vs CRN). Maybe the receptive field rather than temporal dependent matters?
Aria-K-Alethia commented 2 months ago

Hi,

Thank you for being interested in our work.

  1. About 80M
  2. The receptive field can be considered as infinite, as the LSTM is used.
  3. In the w/o LSTM setting the receptive field of CNN is not increased, and we didn't conduct a formal exp to verify your assumption. But I think it is possible.
hbwu-ntu commented 2 months ago

Hi @Aria-K-Alethia thank you for the reply. One follow-up question to point 2 is: what is the receptive field before the RNN layer.

Aria-K-Alethia commented 2 months ago

You can compute it by yourself :smiley: