Why take first and last 180 frames from the utterance spectrogram?

Janghyun1230 / Speaker_Verification

Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"

MIT License

349 stars 104 forks source link

Why take first and last 180 frames from the utterance spectrogram? #7

Closed HashiamKadhim closed 5 years ago

HashiamKadhim commented 5 years ago

I'm just wondering why you use the following lines:

utterances_spec.append(S[:, :config.tisv_frame])    # first 180 frames of partial utterance
utterances_spec.append(S[:, -config.tisv_frame:])   # last 180 frames of partial utterance

there are many examples where S.shape[1] < 360 which causes repeat frames in the utterance_spec. Could this pose a problem?

Janghyun1230 commented 5 years ago

@HashiamKadhim I think I tried to make as many as utterances at the moment. There can be many data preparing strategies. For example, for each mini-batch, we can randomly sample the location of the window. (e.g. given 10s audio, an utterance can be [3.1s, 4.1s] if the length of an utterance is 1s). However, this might occur some overhead of loading long audio each time. Rather than, I choose to slice the utterances as a pre-processing step.