facebookresearch / stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
https://facebookresearch.github.io/stopes/
MIT License
247 stars 37 forks source link

Using `stopes` for filtering instead of mining #23

Closed ZenBel closed 1 year ago

ZenBel commented 1 year ago

Hello,

I have a dataset of English-Italian sentence pairs and I would like to retain only those with a margin score higher than a threshold. what part of the code would I have to change to make stopes filter sentence pairs instead of mining them?

Thank you,

Z

Mortimerp9 commented 1 year ago

Hello @ZenBel, sorry for the delay in responding.

The margin score is computed as described in this paper: https://arxiv.org/pdf/1911.04944.pdf

The margin between two candidate sentences x and y is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions.

This means that you need a set of neighbours to compute the margin, and not just a pair of sentences. For filtering as you describe, you would probably not have this, as you only have left/right pairs. In this case, you can still do some filtering based on the LASER embedding, but not using the margin. Most probably you would want to just do a cosine score between the embedding of each sentence in your pair. However if you have enough data, you could find neighbours in your bitext.

There is code here: https://github.com/facebookresearch/stopes/blob/main/stopes/modules/bitext/laser_scorer.py that does something pretty close to what you want.