JunyiPeng00 / SLT22_MultiHead-Factorized-Attentive-Pooling

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification
11 stars 2 forks source link

Reproduction of MHFA with Baseline.yaml does not converge to the same results as the article #1

Open theolepage opened 3 weeks ago

theolepage commented 3 weeks ago

Hello,

First of all, thank you very much for your work!

I am trying to reproduce your result by applying MHFA (32 heads) on WavLM Base+ which should reach 0.71% EER on VoxCeleb1-O (Table 4). I haven't made any change to the source code and I am using the provided Baseline.yaml config file.

However, the EER is around ~1.00% on VoxCeleb1-O when evaluating after 10 epochs. Please refer to Eval_scores_mean_O_All.txt for the exact output of the evaluation script.

Do you have any idea why I obtain a different result than the one provided in the article?

Thanks in advance.

JunyiPeng00 commented 2 weeks ago

Hello,

Thank you for reaching out and for your interest in using the MHFA method on WavLM Base+. Recently, I re-wrote the whole pipeline with wespeaker toolkit, which you can find here.

Model AS-Norm LMFT QMF vox1-O-clean vox1-E-clean vox1-H-clean
WavLM Base Plus + MHFA × × 0.750 0.716 1.442
WavLM Large + MHFA × × 0.649 0.610 1.235
theolepage commented 2 weeks ago

Thank you for your reply.

Are the results from Table 4 computed with AS-Norm or any score normalization, model average and calibration technique? It could explain the discrepancy between Table 4 results and the one I get with this repository.

In the wespeaker toolkit, have you changed the MHFA code or just the pipeline (LMFT, score normalization/calibration, ...)?