UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.16k stars 2.47k forks source link

Best practice to domain adapt a Multi Lingual Model? #1503

Closed cm2435 closed 1 year ago

cm2435 commented 2 years ago

Hello!

I'm currently working to try create a Legal large language model that can make text embeddings of Legal text across a range of languages, namely english, spanish, french and german primarily.

I'm trying to research best practice for this.

I was thinking of starting with either

  1. LegalBert (A model family domain adapted to Legal knowledge https://arxiv.org/abs/2010.02559) and then following 'Making Monolingual Sentence Embedding Multilingual using Knowledge Distillation' https://arxiv.org/abs/2004.09813

or

  1. Starting with a multilingual base model that has been finetuned for semantic similarity like 'distiluse-base-multilingual-cased' and doing unsupervised domain adaptation using TSDAE https://arxiv.org/abs/2104.06979 on a multilingual corpus.

or

  1. A paper that does multilingual domain adaptation that adds domain knowledge whilst preserving the similarity of words in different languages (I'm not familiar with one but I would happily have one pointed out if anyone knows one.)

My main concern is that if I take a model that is Strong at legal understanding and change the objective (legal bert to multilingual, would the model undergo catastrophic forgetting of either the Legal Domain knowledge or if visa versa, would the model lose it's multilunguality as it gains domain knowledge? And if so, do I need to make a training scheme that optimizes these jointly?

Any help would be hugely appreciated, thanks for reading

wilfoderek commented 1 year ago

Any advances in your research buddy?

cm2435 commented 1 year ago

@wilfoderek In the subdomain of legal models or for multilingual niche language models as a whole?