the 'make_multilingual.py' result is poor

ScottishFold007 commented 2 years ago

I have tryied the example 【 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation 】，I ran the example directly without changing any parameters, but the result looks terrible and I don't know what the problem is. here is the results in colab . Can you give me some insight? Thanks!

LaLlorona commented 2 years ago

I have the same problem. I tested it for the en-de and en-ko datasets, and the result is poor. here is my colab notebook, and I ran for about 12 hours. As you can see, correlation does not improve while the learning process, and every sentence similarity is 1.0.

LaLlorona commented 2 years ago

When I changed the teacher model from 'all-distilroberta-v1' to 'bert-base-nli-stsb-mean-tokens' It makes a much better result. Currently, I'm trying English - Japanese multilingual learning, and it gives about 80% accuracy on translation, and about 70% similarity on the STS test on the 7000th step in the first epoch. I don't know why the default teacher model does not work fine, but it seems changing teacher makes better student

nreimers commented 2 years ago

I will have a look at Monday. I think the reason can be the teacher model, the all-* models have a layer that normalize embeddings to unit length.

Using the paraphrase- models or disabling this normalization layer for the all- models could help.

mathislucka commented 2 years ago

@nreimers did you have time to check if the normalization layer is responsible for the poor results. I tried using one of the multi-qa models for multilingual knowledge distillation and as far as I understand, they also have a normalization layer. How would I disable the normalization layer when loading the model with SentenceTransformer?

mathislucka commented 2 years ago

I tried it this way but I'm still getting very poor results.

model_name = 'multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)

model = SentenceTransformer(modules=[model[0], model[1]])

model.save(save_dir)

mathislucka commented 2 years ago

I also tried adding a normalization layer to my student model (xlm-roberta-base). The results remain poor. The translation accuracy is falling during training and then stagnating at a very low accuracy.

nreimers commented 2 years ago

I sadly was not yet able to investigate. I'm currently until Oct 15th on vacation, I will then have a look. Does the script work with the paraphrase-* models as teacher?

thuan00 commented 2 years ago

Duplicates: https://github.com/UKPLab/sentence-transformers/issues/1068 Issues are pilling up huh

Hi @nreimers, great work, I really appreciate it.

so I was trying to bi-lingualize (make_multilingual.py) those msmarco models, for a Vietnamese QA system research.

I successfully distilled 2 msmarco-cos models to 2 **paraphrase-*** same-architecture models

No luck for, for example, XLM-R learning from msmarco-bert-base-dot-v5.

Can I have some comments or suggestions on distilling those dot models ?

It would be great if, say, XLM-R can learn from the best model, msmarco-bert-base-dot-v5

I expect it can further improve the result of the experiment below.

--

Heres the experiment:

training data and code: https://drive.google.com/drive/folders/11rQYSN3-OIpLdxPVPL_ZstT2hixpTbC-?usp=sharing training output folder: https://drive.google.com/drive/folders/1lRxqCEpnFR_db-N-F_cgdGyHfdVHp7ss?usp=sharing

training data:

PhoMT: 3M pairs, but I filter char length < 150 to discard OpenSub, TEDx data, and then sample 100-300k pairs
MKQA: 10k question pairs
MLQA+XQuAD: 7k question pairs, 700 context pairs

eval data:

MSE: MLQA-dev: 550 question pairs, 150 context pairs
IR: my own minimal test set: 70 questions, corpus of 130 passages, max_token_len: 260

details:

lr=2e-5, warmup=2000, little or no weight_decay,
training_steps: 20-40k steps, batchsize 16-32, so generally, the models went thru 300-600k pairs
the best score r mostly recorded at the middle of the training, further training reduces the score a bit.

I experimented max_seq_length=256 and 384, and train_max_sentence_length=520 and 800 accordingly so that theres no overlow tokens, which leads to bad training examples. The results doesnt vary significantly, maybe due to the small eval dataset.

teacher	student	MRR@10 before	MRR@10 after	MSE*100
msmarco-MiniLM-L12-cos-v5	paraphrase-m-minilm-v2	45	64	<0.1
same, removed norm layer	paraphrase-m-minilm-v2	45	56	>3

msmarco-distilbert-cos-v5	paraphrase-xlm-r	48	-
msmarco-distilbert-cos-v5	xlm-r	-	-
multi-qa-mpnet-cos-v1	paraphrase-m-mpnet-v2	51	72	<0.05

dot models
multi-qa-minilm-dot-v1	reimers/m-minilm-v2	-	21	>5
multi-qa-mpnet-dot-v1	paraphrase-xlm-r	48	-	>5
msmarco-bert-base-dot-v5	paraphrase-xlm-r	48	-	>2
msmarco-bert-base-dot-v5	xlm-r	-	-	>2

(-) indicate bad score, a failed experiment (BM25 MRR@10 is about 63)

UKPLab / sentence-transformers

the 'make_multilingual.py' result is poor #1164

Can I have some comments or suggestions on distilling those dot models ?

Heres the experiment: