UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

the 'make_multilingual.py' result is poor #1164

Open ScottishFold007 opened 2 years ago

ScottishFold007 commented 2 years ago

I have tryied the example 【 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation 】,I ran the example directly without changing any parameters, but the result looks terrible and I don't know what the problem is. here is the results in colab . Can you give me some insight? Thanks!

LaLlorona commented 2 years ago

I have the same problem. I tested it for the en-de and en-ko datasets, and the result is poor. here is my colab notebook, and I ran for about 12 hours. As you can see, correlation does not improve while the learning process, and every sentence similarity is 1.0.

LaLlorona commented 2 years ago

When I changed the teacher model from 'all-distilroberta-v1' to 'bert-base-nli-stsb-mean-tokens' It makes a much better result. Currently, I'm trying English - Japanese multilingual learning, and it gives about 80% accuracy on translation, and about 70% similarity on the STS test on the 7000th step in the first epoch. I don't know why the default teacher model does not work fine, but it seems changing teacher makes better student

nreimers commented 2 years ago

I will have a look at Monday. I think the reason can be the teacher model, the all-* models have a layer that normalize embeddings to unit length.

Using the paraphrase- models or disabling this normalization layer for the all- models could help.

mathislucka commented 2 years ago

@nreimers did you have time to check if the normalization layer is responsible for the poor results. I tried using one of the multi-qa models for multilingual knowledge distillation and as far as I understand, they also have a normalization layer. How would I disable the normalization layer when loading the model with SentenceTransformer?

mathislucka commented 2 years ago

I tried it this way but I'm still getting very poor results.

model_name = 'multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)

model = SentenceTransformer(modules=[model[0], model[1]])

model.save(save_dir)
mathislucka commented 2 years ago

I also tried adding a normalization layer to my student model (xlm-roberta-base). The results remain poor. The translation accuracy is falling during training and then stagnating at a very low accuracy.

nreimers commented 2 years ago

I sadly was not yet able to investigate. I'm currently until Oct 15th on vacation, I will then have a look. Does the script work with the paraphrase-* models as teacher?

thuan00 commented 2 years ago

Duplicates: https://github.com/UKPLab/sentence-transformers/issues/1068 Issues are pilling up huh

Hi @nreimers, great work, I really appreciate it.

so I was trying to bi-lingualize (make_multilingual.py) those msmarco models, for a Vietnamese QA system research.

I successfully distilled 2 msmarco-cos models to 2 **paraphrase-*** same-architecture models

No luck for, for example, XLM-R learning from msmarco-bert-base-dot-v5.

Can I have some comments or suggestions on distilling those dot models ?

It would be great if, say, XLM-R can learn from the best model, msmarco-bert-base-dot-v5

I expect it can further improve the result of the experiment below.

--

Heres the experiment:

training data and code: https://drive.google.com/drive/folders/11rQYSN3-OIpLdxPVPL_ZstT2hixpTbC-?usp=sharing training output folder: https://drive.google.com/drive/folders/1lRxqCEpnFR_db-N-F_cgdGyHfdVHp7ss?usp=sharing

training data:

eval data:

details:

I experimented max_seq_length=256 and 384, and train_max_sentence_length=520 and 800 accordingly so that theres no overlow tokens, which leads to bad training examples. The results doesnt vary significantly, maybe due to the small eval dataset.

teacher student MRR@10 before MRR@10 after MSE*100
msmarco-MiniLM-L12-cos-v5 paraphrase-m-minilm-v2 45 64 <0.1
same, removed norm layer paraphrase-m-minilm-v2 45 56 >3
msmarco-distilbert-cos-v5 paraphrase-xlm-r 48 -
msmarco-distilbert-cos-v5 xlm-r - -
multi-qa-mpnet-cos-v1 paraphrase-m-mpnet-v2 51 72 <0.05
dot models
multi-qa-minilm-dot-v1 reimers/m-minilm-v2 - 21 >5
multi-qa-mpnet-dot-v1 paraphrase-xlm-r 48 - >5
msmarco-bert-base-dot-v5 paraphrase-xlm-r 48 - >2
msmarco-bert-base-dot-v5 xlm-r - - >2

(-) indicate bad score, a failed experiment (BM25 MRR@10 is about 63)