Closed PiotrDabkowski closed 2 years ago
I'm not sure which part you want to say is different. Could you specify the point?
For the index of wav2vec2 transformer encoder here, https://github.com/dhchoi99/NANSY/blob/2440ec77a7f0962a0a335ba7949a29c5798c3224/models/analysis.py#L38 https://github.com/dhchoi99/NANSY/blob/2440ec77a7f0962a0a335ba7949a29c5798c3224/models/analysis.py#L82
I thought using 1 and 12 is correct since transformers' wav2vec2 transformer encoder implementation appends PositionalConvEmbedding output at the front of the tuple of hidden states.
1 and 12 are correct, but in your code you use 1 for Linguistic and 12 for Speaker. It should be 12 for Linguistic and 1 for Speaker. https://github.com/dhchoi99/NANSY/blob/2440ec77a7f0962a0a335ba7949a29c5798c3224/models/analysis.py#L82
What a mistake... I really appreciate you pointing that out. I'll fix that. Thanks!!
Happy to help, let me know if you manage to improve your results :)
hi, I find the same mistake in the train_torch.py file~
The speaker should use layer 1, the linguistic input should use layer 12 (as in the paper). I noticed in your implantation it is the other way around, did you get satisfactory results despite that?