UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.25k stars 2.47k forks source link

Use Word Embeddings for Multi-Lingual Downstream Task #856

Open HariWu1995 opened 3 years ago

HariWu1995 commented 3 years ago

My task is about multi-label text classification. Firstly, I use pretrained monolingual BERT for word embeddings for my model, it works fine with accuracy ~50% and top-3 accuracy ~80%. Now, I need more generalized word embeddings for transfer learning to quickly adapt to another language. I have tested mBERT and your pretrained models, and I found that the distribution of similarity between different languages from your model is better (my test was on sentence-level embeddings). But I need word-level embeddings to feed into my model. When I declare output_value='token_embeddings', my model works pretty bad, the best finetune is ~20% accuracy and ~40% top-3 accuracy. So my question is: "Is your pretrained model suitable to apply word embeddings for downstream task? If yes, which version should I use?" - I am using distiluse-v2. Thank you in advance

nreimers commented 3 years ago

The model was not really trained for word level embeddings. Also the distiluse model uses a dense 768x512 layer on top of the pooling layer, that might skew the word embeddings. Maybe the other multi lingual models work better