Closed fym0503 closed 2 years ago
Hi,
There are two scenarios:
An alternative way is that you may train the model of each embedding with different settings (e.g. the last layer or the last four layers, w/ fine-tuning or w/o fine-tuning) and compare the model accuracy to decide the final usage of each embedding.
By the way, we use the first subtoken as the representation of each token.
Thanks, Your comments are very clear. As I have done some similar experiments, my conclusion is nearly the same.
Hi, I have read your interesting paper and code. My question is: As BERT and XLM-R has many layers. I wonder what kind of embeddings you use ? Just the word embedding or a mixture of intermediate layer representation ? Did you find the difference between these options ? Thanks !