TzuchengChang / NASS

Noise-Aware Speech Separation with Contrastive Learning
14 stars 6 forks source link

compatibility with the paper #3

Closed MordehayM closed 6 months ago

MordehayM commented 6 months ago

Hi, The paper states that the output shape of the sampler is H^2 x H^2, but it appears that is not the case in the code here: https://github.com/TzuchengChang/NASS/blob/ab98c434a5b51ff0ebfd7d3a97a9876135fa52dd/speechbrain/speechbrain/lobes/models/networks.py#L36

Thanks

TzuchengChang commented 6 months ago

Hi, The paper states that the output shape of the sampler is H^2 x H^2, but it appears that is not the case in the code here:

https://github.com/TzuchengChang/NASS/blob/ab98c434a5b51ff0ebfd7d3a97a9876135fa52dd/speechbrain/speechbrain/lobes/models/networks.py#L36

Thanks

This is a calculated result rather than something in the code. For a convolutional layer with a kernel size of H, the range of its convolution at each step covers a square area of H^2. As you can see, our code includes two convolutional layers, so the overall range of convolution would be H^2 * H^2. The output shape here refers to the representation, rather than the intermediate result.

MordehayM commented 6 months ago

So if I understood you correctly, the output shape of the sampler is not H^2 x H^2? I am confused because the paper states that this is the output shape of the sampler 🤔

TzuchengChang commented 6 months ago

So if I understood you correctly, the output shape of the sampler is not H^2 x H^2? I am confused because the paper states that this is the output shape of the sampler 🤔

Yes, you're right. The expression in the paper does indeed have a bit of ambiguity.

TzuchengChang commented 6 months ago

So if I understood you correctly, the output shape of the sampler is not H^2 x H^2? I am confused because the paper states that this is the output shape of the sampler 🤔

A more accurate expression would be that it should be equivalent to the shape that can be processed with each sampling.

MordehayM commented 6 months ago

Since your convolution layers are without padding and with unit stride, the output shape for tensor with shape [F, L] should be [F-H+1, L-H+1]. Then, you chose only K bins and passed them through the linear layers.