How to extract aligned embeddings for two languages with minimal model size?

averkij commented 3 years ago

Hello, folks 👋

🎄 First of all, thank you for your work and wish you all the best in the upcoming year.

🌱 I'm solving the parallel texts alignment task for a pair of certain languages and making a tool for it. I am using ST models ("distiluse-base-multilingual-cased" and "LaBSE") and results are quite good. However it is challenging to deploy such a tool because of the size of the models.

❓ So, how to extract from existing models or train aligned embeddings only for two languages (Russian-Chinese in particular) in a new model with a smaller size (say 50 mb)?

nreimers commented 3 years ago

Hi @averkij You can follow the example here how to remove layers: https://www.sbert.net/examples/training/distillation/README.html

The biggest issue is the embedding layer.

The multilingual models have quite large number vocabularies, and each entry in the vocab. has an own embedding. But if you only work for two languages, most of the entries in the vocab are not used.

You could identify which entries on the vocab you actually need for Russian and Chinese. Then remove all other in the embedding layer. You might have to update the tokenizer too, so that tokens you deleted will be mapped to UNK. Otherwise you get an out-of-index error.

This change will not be that easy, as you need to dig deep intro transformers package, remove the correct rows in the embedding matrix and change the tokenizer so that it works with your reduced vocabulary.

Another nice solution could be to use CharacterBERT (should be soon integrated into Hugginface Transformers): https://github.com/helboukkouri/character-bert

Here, the embedding layer is replaced by a character based CNN. This makes the models much smaller.

averkij commented 2 years ago

Managed to shrink the embeddings layer. The process described in this article.

UKPLab / sentence-transformers

How to extract aligned embeddings for two languages with minimal model size? #634

Hello, folks 👋