facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.35k stars 6.4k forks source link

Unexpected result on finetuned model Wav2Vec2 (with TIMIT dataset) #3043

Open osddeitf opened 3 years ago

osddeitf commented 3 years ago

❓ Questions and Help

I want some help on preparing dataset for finetuning on Wav2Vec2 model.

Long story short

I wanted to do some self-interest tool on recognizing English phonemes, and after a bunch of hours, I found this repo with Wav2Vec2 model, with high empirical result on phonemes recognition in TIMIT dataset.

I finetuned on TIMIT dataset, but when I run inference on my finetuned model, with a single .wav audio in the dataset, what I face, is this:

bcl

or in some rarely, that:

axr

or even nothing (empty string) at all.

I've tried some other audios in different sources as well, single channel and 16k bitrate like suggested. But what I expect are at least a long sequence of phonemes, without any thought of accuracy, just for experiment before taking any deeper digging, but like above, it's just WRONG, not INACCURATE.

What I've tried

I tried to finetune Wav2Vec according to https://github.com/pytorch/fairseq/tree/master/examples/wav2vec README. And then after about 100 epoch, I tried to evaluate it.

Then I found this famous issue #2651, and try to used the recognize.py code provide new model. But it's not easy as I thought, my model somehow not compatible.

And another hours passed, finally I barely managed to run my model by modify recognize.py like this:

@dataclass
class Wav2Vec2CheckpointConfig(FairseqConfig):
    model: Wav2Vec2AsrConfig = Wav2Vec2AsrConfig()

def load_model(model_path, target_dict):
    w2v = torch.load(model_path)
    OmegaConf.set_struct(w2v["cfg"], False)
    cfg = OmegaConf.merge(OmegaConf.structured(Wav2Vec2CheckpointConfig), w2v["cfg"])
    model = Wav2VecCtc.build_model(cfg.model, target_dict)

    model.load_state_dict(w2v["model"], strict=True)
    return model

The reason is that newer training method (use Hydra) not saving anything in args, so I need to reconstruct configuration in new property cfg of checkpoint. I even had tested on good old base model and print to verify it.

Digging further, printing model parameters at runtime, I can ensure that I'm not missing a thing. I could only thought of some incorrect preparation of the finetuning dataset, so I tried a couple of different ways:

attempt_1.ltr

bcl sh iy hh ae bcl d y axr bcl d aa r bcl k s uw dx ih eng bcl g r iy s iy w aa sh bcl w aa dx axr bcl aa el y iy axr bcl
bcl hh ih z bcl k ae bcl t ih en w ah s th ih en ae en hh ae bcl g axr bcl d ih en ih z bcl b y uw dx uw f el bcl b uw bcl t s bcl w axr w aa r en ih en bcl sh ae bcl b iy bcl
bcl dh ih r iy z ah en z f axr dh ih s bcl d ay v s iy em bcl d f uw el ih sh bcl en aw bcl
...

attempt_1.wrd

she had your dark suit in greasy wash water all year
his captain was thin and haggard and his beautiful boots were worn and shabby
the reasons for this dive seemed foolish now
...

attempt_1.dict.ltr.txt

aa 1
ae 1
ah 1
aw 1
ay 1
...

attempt_2.ltr

sh iy hh ae d y axr d aa r k s uw dx ih eng g r iy s iy w aa sh w aa dx axr aa el y iy axr |
hh ih z k ae t ih en w ah s th ih en ae en hh ae g axr d ih en ih z b y uw dx uw f el b uw t s w axr w aa r en ih en sh ae b iy |
dh ih r iy z ah en z f axr dh ih s d ay v s iy em d f uw el ih sh en aw |

attempt_2.wrd

bcldowenaesbclemiybcltihbclkehriyihenoyeliyraebclgelaybclkdhaebclbcl
bcldraabclfayvfaaremzahendhahbclbaabclksbclbahfaaryihbclgowawbclbcl
bclbclehelbcldaxreliybclpiybclpelaxraafihenihbclbclksbclkeluwbcldihbcldbcl

dict.ltr.txt

| 1
aa 1
ae 1
ah 1

...

I combine any different approach I could use, like add | in the first line of dict.ltr.txt, or convert words to phonemes, anyway, but unfortunately I just couldn't get it to work.

I dig into various Issues on this repo (like #2922), can't help but find things more confused (though I managed to execute scripts well), like I've described above. After, like 1 hundred times try and trying again, cost me a hundred of hours of EC2 in AWS, I couldn't find any better result.

I'm REALLY REALLY appreciate for any help. And if you wish, I willingly buy you some cups of coffee.

What's your environment?

vocaliodmiku commented 3 years ago

hi there, i had met the same issue as you mentioned above. I followed the guide from #2922 but the valid wer always remained 90+ and do you have any progress ?

osddeitf commented 3 years ago

No progress, either. And my work not need this model anymore. Sorry.

elgeish commented 3 years ago

I did it using Hugging Face; feel free to try it yourself: https://github.com/huggingface/transformers/pull/10581 https://github.com/elgeish/transformers/blob/e72e6e5a3fe2547432005d2ffe3208f8d84cbe02/examples/research_projects/wav2vec2/run_asr.py

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!