Questions about details

hbwu-ntu commented 2 months ago

Hi @Aria-K-Alethia thank you for the amazing work. May I ask several questions:

What is the model size for small-enc in Table 3?
What is the receptive field of the encoder in terms of second?
For w/o LSTM, what if you increase the dilation or kernel size of each conv layer to enlarge the receptive field to a reasonable level (such as 200 ms)? I guess if the receptive field is large enough, the model w/o LSTM may perform similarly to the one with LSTM, as shown in some speech enhancement models (UNet vs CRN). Maybe the receptive field rather than temporal dependent matters?

Aria-K-Alethia commented 2 months ago

Hi,

Thank you for being interested in our work.

About 80M
The receptive field can be considered as infinite, as the LSTM is used.
In the w/o LSTM setting the receptive field of CNN is not increased, and we didn't conduct a formal exp to verify your assumption. But I think it is possible.

hbwu-ntu commented 2 months ago

Hi @Aria-K-Alethia thank you for the reply. One follow-up question to point 2 is: what is the receptive field before the RNN layer.

Aria-K-Alethia commented 2 months ago

You can compute it by yourself :smiley:

Aria-K-Alethia / BigCodec