Open XIEchoAH opened 3 years ago
Frame length is how much audio you need to process to output the first bit of audio. The formula is obtained from calculating the receptive field of 5 stacked convolutions with kernel size 8 and stride 4. You need this receptive field to be full to compute the first LSTM step. Then the stride is given by the overall stride of 5 stacked convolutions with a stride of 4, so 4^5.
Note that with 37ms of input of audio, the model will only output 16ms, due to the overlap in the output of the transposed convolutions. Then every time it receives 16ms, it will output another 16ms of audio.
Hi, sorry to ask the same question here again.
I followed your elaboration from issue #82 which explained 37ms (597 / 16k) latency from the U-Net structure.
But I am still confused about the 16ms length of the output. The formula is given above 4 (stride) ^5 (depth) . Could you explain more details on this formula? How does this formula work actually?
Thank you in advance for your reply.
According to this doc that i had read for calculating Receptive field length, 4**5 seems to fit the description of receptive field length.
------
input_layer: n = 1; r = 1; j = 1; start = 0.5
------
------
conv1: n = -1; r = 4; j = 8; start = 5.0
------
------
conv2: n = -2; r = 16; j = 36; start = 27.0
------
------
conv3: n = -2; r = 64; j = 148; start = 99.0
------
------
conv4: n = -2; r = 256; j = 596; start = 387.0
------
------
conv5: n = -2; r = 1024; j = 2388; start = 1539.0
The model up sample the audio by a factor of 4 before feeding it to the model. Thus in the original sample rate the overall stride is 4^5 / 4 = 256 which gives 256/16000=16ms.
Hello,sorry to disturb you. I read the paper and code, but still confused about the frame length and frame shift of the audios.
In the Training paragraph, it said " With this setup, the causal DEMUCS processes audio has a frame size of 37 ms and a stride of 16 ms."
Here, why the frame length and frame shift is 37 and 16ms ? How is it calculated?
Hopefully to hear from you.