I'm currently working to try create a Legal large language model that can make text embeddings of Legal text across a range of languages, namely english, spanish, french and german primarily.
Starting with a multilingual base model that has been finetuned for semantic similarity like 'distiluse-base-multilingual-cased' and doing unsupervised domain adaptation using TSDAE https://arxiv.org/abs/2104.06979 on a multilingual corpus.
or
A paper that does multilingual domain adaptation that adds domain knowledge whilst preserving the similarity of words in different languages (I'm not familiar with one but I would happily have one pointed out if anyone knows one.)
My main concern is that if I take a model that is Strong at legal understanding and change the objective (legal bert to multilingual, would the model undergo catastrophic forgetting of either the Legal Domain knowledge or if visa versa, would the model lose it's multilunguality as it gains domain knowledge? And if so, do I need to make a training scheme that optimizes these jointly?
Any help would be hugely appreciated, thanks for reading
Hello!
I'm currently working to try create a Legal large language model that can make text embeddings of Legal text across a range of languages, namely english, spanish, french and german primarily.
I'm trying to research best practice for this.
I was thinking of starting with either
or
or
My main concern is that if I take a model that is Strong at legal understanding and change the objective (legal bert to multilingual, would the model undergo catastrophic forgetting of either the Legal Domain knowledge or if visa versa, would the model lose it's multilunguality as it gains domain knowledge? And if so, do I need to make a training scheme that optimizes these jointly?
Any help would be hugely appreciated, thanks for reading