WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
362 stars 98 forks source link

Why extend audio? #28

Closed happypanda5 closed 5 years ago

happypanda5 commented 5 years ago

In the function load_wav_Predict why did you need to extend the audio file (see code below)

I cannot think of a reason why one would need to extend the time signal

def load_wav_Predict(vid_path, sr):
    wav, sr_ret = librosa.load(vid_path, sr=sr )
    assert sr_ret == sr
    extended_wav = np.append(wav, wav[::-1])
    return extended_wav
WeidiXie commented 5 years ago

there's no specific reason, this is simple flip, as data augmentation.

happypanda5 commented 5 years ago

Okay.

seungwonpark commented 5 years ago

Why did you append the flipped wav for mode='eval'? Applying this kind of data preprocessing during the evaluation phase should not be regarded as data augmentation, in my opinion.

happypanda5 commented 5 years ago

Applying this kind of data preprocessing during the evaluation phase should not be regarded as data augmentation, in my opinion.

I agree.

However if Weidi got excellent results using this data as input then we might as well use it. Another option is to train the network again without such input y which I do not have time to do at this moment.

@WeidiXie : I must say that I really like your work. Can you share other trained models?

WeidiXie commented 5 years ago

I did that after reading this paper: https://arxiv.org/abs/1807.08312

Well, this is my first trial on audio-related tasks, so it's possible that lots of sections can be further improved if doing things properly, e.g. reading pipeline, data augmentation, hyper parameters, etc.

Yes, I can share the other models, at the moment, I'll a bit busy with other projects, but I'll try to release them as soon as possible.

Best, Weidi