Closed HashiamKadhim closed 5 years ago
@HashiamKadhim I think I tried to make as many as utterances at the moment. There can be many data preparing strategies. For example, for each mini-batch, we can randomly sample the location of the window. (e.g. given 10s audio, an utterance can be [3.1s, 4.1s] if the length of an utterance is 1s). However, this might occur some overhead of loading long audio each time. Rather than, I choose to slice the utterances as a pre-processing step.
I'm just wondering why you use the following lines:
there are many examples where S.shape[1] < 360 which causes repeat frames in the utterance_spec. Could this pose a problem?