Open Mountchicken opened 2 years ago
Hi Qing,
Thanks for the question.
I think either way works. The reason that I chose to pad fbank
rather than waveform was just to explicitly control the input shape (on the time dimension) of the network.
Is there an advantage you think padding the waveform would have?
-Yuan
Hi @YuanGongND Thanks for the prompt reply.
fbank
is also similar to image as it's a 2D tensor, so it is reasonable to pad. But my concern is that fbank
is generated from waveform. For waveform, padding is more like to continue recording a bit more sound after the recording is finished. But padding the fbank
directly seems to be less intuitive.fbank
to a target size, just like what is done in CV?In CV, we resize and pad images of different sizes to form a batch, and in speech, fbank is also similar to image as it's a 2D tensor, so it is reasonable to pad. But my concern is that fbank is generated from waveform. For waveform, padding is more like to continue recording a bit more sound after the recording is finished. But padding the fbank directly seems to be less intuitive.
I agree that padding the waveform could be better for the last element of the tensor. But in practice, for a sequence of hundreds of elements, the impact is minor.
And I got another naive question, does it make sense to resize fbank to a target size, just like what is done in CV?
I do not believe so, for the frequency dimension, all samples should be the same, so no need to resize, for the time dimension, resizing means time warping, which is usually undesired.
-Yuan
Hi For audio of different lengths, the padding operation in dataset is taken on
fbank
. So why not padding on waveform first and then convert it tofbank
.