Different wav2vec layers used

PiotrDabkowski commented 2 years ago

The speaker should use layer 1, the linguistic input should use layer 12 (as in the paper). I noticed in your implantation it is the other way around, did you get satisfactory results despite that?

dhchoi99 commented 2 years ago

I'm not sure which part you want to say is different. Could you specify the point?

For the index of wav2vec2 transformer encoder here, https://github.com/dhchoi99/NANSY/blob/2440ec77a7f0962a0a335ba7949a29c5798c3224/models/analysis.py#L38 https://github.com/dhchoi99/NANSY/blob/2440ec77a7f0962a0a335ba7949a29c5798c3224/models/analysis.py#L82

I thought using 1 and 12 is correct since transformers' wav2vec2 transformer encoder implementation appends PositionalConvEmbedding output at the front of the tuple of hidden states.

https://github.com/huggingface/transformers/blob/7cb1fdd4d109165722858cd626abbfc8d5e2ebc4/src/transformers/models/wav2vec2/modeling_wav2vec2.py#L715-L737

PiotrDabkowski commented 2 years ago

1 and 12 are correct, but in your code you use 1 for Linguistic and 12 for Speaker. It should be 12 for Linguistic and 1 for Speaker. https://github.com/dhchoi99/NANSY/blob/2440ec77a7f0962a0a335ba7949a29c5798c3224/models/analysis.py#L82

dhchoi99 commented 2 years ago

What a mistake... I really appreciate you pointing that out. I'll fix that. Thanks!!

PiotrDabkowski commented 2 years ago

Happy to help, let me know if you manage to improve your results :)

FrostMiKu commented 2 years ago

hi, I find the same mistake in the train_torch.py file~

dhchoi99 / NANSY

Different wav2vec layers used #3