distiluse-base-multilingual-cased has one more dense layer compared to the pool-only model. How is this dense layer added?

imchenmin commented 4 years ago

Hi, thank you for your great work. distiluse-base-multilingual-cased has one more dense layer compared to the pool-only model. How is this dense layer added? We are constructing a Chinese long text processing system based on the sentence-transformer. Among them, I used the published distilluse-base-multilingual-cased model. This model has a strong ability to express Chinese long text (<510 words), but no training code (in the example directory) for this model was found. A few days ago you posted training code for a new multilingual model. We trained for 20 rounds using the same format of Chinese corpus (TED2013-zh-zh, xnli-zh-zh, sts2017-zh-zh-by-google-translate). Only reached 70% accuracy of our dataset, and 0.748 cosine-pearson in evaluation step. And xlm-roberta consumes more capacity than distill-bert.

nreimers commented 4 years ago

Hi @chenminken The distiluse-base-multilingual-cased model uses model distillation of the universal sentence encoder (USE): https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3

DistilBERT produces output vector of size 768. To transformer them to 512 dimension, what is used by USE, a dense layer was added on top of mean pooling.

I think the results of USE / DistilUSE for STS are not comparable to other models, as USE had the STS data (and their translations with google translate) in their training setup. So it is likely that USE / DistilUSE overfits on the STS data, as this data was already seen during training time.

Maybe you can translate the test set from STS benchmark and see how DistilUSE / your model performs there. I think USE was not trained on the test set of STS benchmark, but sadly the paper of USE is not really specific which data set were used so it hard to answer that question.

Best Nils Reimers

connection-on-fiber-bundles commented 4 years ago

Hey @nreimers, first of all, thanks for the distiluse-base-multilingual-cased model and all the related work! Got some questions here: Why do we want to do distillation with multilingual USE in the first place? Checked the USE model in the tfhub, it seems its number of parameters is around 85 millions. Sure, it's greater than distill-bert, but not by a lot. Also is the distilled student model has faster inference speed than the teacher model? Any experiment result? Also, when training distiluse-base-multilingual-cased model, have you used the training methods mentioned in the Sentence-Bert model to do some sort of pre-training? In other words, is distiluse-base-multilingual-cased model here related to the Sentence-Bert paper at all?

nreimers commented 4 years ago

Hi @connection-on-fiber-bundles

The DistilUSE model is just a toy example and has only minor advantages compared to directly using mUSE. One difference is that using tfhub and mUSE can be rather complicated and does not work for example on Windows. I always found it painful to get mUSE up and running. DistilUSE uses pytorch and can easily be used.

But the reason for the model is something else.

Today, our new paper was released: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

There, we describe a method to extend existing sentence embedding methods to new (different) languages. Some code is already available: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training_multilingual

More code & docs will be added over the next days. The DistilUSE model uses the training process from that paper with teacher=mUSE and student=DistilBERT-multilingual.

The method in the paper allows to extend any sentence embedding model to new languages. mUSE is currently limited to 16 languages, but with the provided method & code it can be easily extend to 100 or more languages. The method also works if you only have an English model, it can be converted such that it works for any language you have parallel training data for.

If mUSE works well for you use-case and the your language is covered, it makes more sense to use mUSE instead of DistilmUSE. However, for various cases, mUSE doesn't work that well or it might be that you require a sentence embedding model for a language that is not covered by mUSE.

Let me know if you have further questions.

Best Nils Reimers

UKPLab / sentence-transformers

distiluse-base-multilingual-cased has one more dense layer compared to the pool-only model. How is this dense layer added? #186