I am curious about the model architecture details in the model, esp. the convnet:
Are there any down-sampling conv like ConvNext? If there is no down-sampling, the output tensor should have size up to (32,000, 512), which is huge. Will that become a problem?
Is the model isotropic that each layer outputs the same number of channels?
I tried my best but could not find any details about it.
Hi, @Lupin1998
Thanks for the awesome work!
I am curious about the model architecture details in the model, esp. the convnet:
I tried my best but could not find any details about it.
Thanks!