cisnlp / simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
MIT License
347 stars 47 forks source link

Incorporate LaBSE as a model option #31

Closed kongyurui closed 1 year ago

kongyurui commented 2 years ago

I modified simalign to use LaBSE (or "pvl/labse_bert") for underlying multilingual model to calculate embeddings. It showed better precision and recall on the alignments that either mBERT or XLM-RoBERTa and I think it would be a useful additional option for simalign.

creolio commented 2 years ago

Which language pairs (and directions) did you test calculating embeddings on?

pdufter commented 2 years ago

That's a great suggestion - thanks for the pointer. It seems you already modified the simalign code? If yes, it would be great if you could create a PR. @masoudjs could review it and/or potentially help integrating it.

kongyurui commented 2 years ago

After looking through the library, I realized I didn't need to modify any code. All I needed to do to use LaBSE was:

myaligner = SentenceAligner(model="pvl/labse_bert", token_type="bpe", matching_methods="a", device="cuda")