Open ScottishFold007 opened 2 years ago
I have the same problem. I tested it for the en-de and en-ko datasets, and the result is poor. here is my colab notebook, and I ran for about 12 hours. As you can see, correlation does not improve while the learning process, and every sentence similarity is 1.0.
When I changed the teacher model from 'all-distilroberta-v1' to 'bert-base-nli-stsb-mean-tokens' It makes a much better result. Currently, I'm trying English - Japanese multilingual learning, and it gives about 80% accuracy on translation, and about 70% similarity on the STS test on the 7000th step in the first epoch. I don't know why the default teacher model does not work fine, but it seems changing teacher makes better student
I will have a look at Monday. I think the reason can be the teacher model, the all-* models have a layer that normalize embeddings to unit length.
Using the paraphrase- models or disabling this normalization layer for the all- models could help.
@nreimers did you have time to check if the normalization layer is responsible for the poor results. I tried using one of the multi-qa models for multilingual knowledge distillation and as far as I understand, they also have a normalization layer. How would I disable the normalization layer when loading the model with SentenceTransformer?
I tried it this way but I'm still getting very poor results.
model_name = 'multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)
model = SentenceTransformer(modules=[model[0], model[1]])
model.save(save_dir)
I also tried adding a normalization layer to my student model (xlm-roberta-base). The results remain poor. The translation accuracy is falling during training and then stagnating at a very low accuracy.
I sadly was not yet able to investigate. I'm currently until Oct 15th on vacation, I will then have a look. Does the script work with the paraphrase-* models as teacher?
Duplicates: https://github.com/UKPLab/sentence-transformers/issues/1068 Issues are pilling up huh
Hi @nreimers, great work, I really appreciate it.
so I was trying to bi-lingualize (make_multilingual.py) those msmarco models, for a Vietnamese QA system research.
I successfully distilled 2 msmarco-cos models to 2 **paraphrase-*** same-architecture models
No luck for, for example, XLM-R learning from msmarco-bert-base-dot-v5.
It would be great if, say, XLM-R can learn from the best model, msmarco-bert-base-dot-v5
I expect it can further improve the result of the experiment below.
--
training data and code: https://drive.google.com/drive/folders/11rQYSN3-OIpLdxPVPL_ZstT2hixpTbC-?usp=sharing training output folder: https://drive.google.com/drive/folders/1lRxqCEpnFR_db-N-F_cgdGyHfdVHp7ss?usp=sharing
training data:
eval data:
details:
I experimented max_seq_length=256 and 384, and train_max_sentence_length=520 and 800 accordingly so that theres no overlow tokens, which leads to bad training examples. The results doesnt vary significantly, maybe due to the small eval dataset.
teacher | student | MRR@10 before | MRR@10 after | MSE*100 |
---|---|---|---|---|
msmarco-MiniLM-L12-cos-v5 | paraphrase-m-minilm-v2 | 45 | 64 | <0.1 |
same, removed norm layer | paraphrase-m-minilm-v2 | 45 | 56 | >3 |
msmarco-distilbert-cos-v5 | paraphrase-xlm-r | 48 | - | |
msmarco-distilbert-cos-v5 | xlm-r | - | - | |
multi-qa-mpnet-cos-v1 | paraphrase-m-mpnet-v2 | 51 | 72 | <0.05 |
dot models | ||||
multi-qa-minilm-dot-v1 | reimers/m-minilm-v2 | - | 21 | >5 |
multi-qa-mpnet-dot-v1 | paraphrase-xlm-r | 48 | - | >5 |
msmarco-bert-base-dot-v5 | paraphrase-xlm-r | 48 | - | >2 |
msmarco-bert-base-dot-v5 | xlm-r | - | - | >2 |
(-) indicate bad score, a failed experiment (BM25 MRR@10 is about 63)
I have tryied the example 【 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation 】,I ran the example directly without changing any parameters, but the result looks terrible and I don't know what the problem is. here is the results in colab . Can you give me some insight? Thanks!