Using `stopes` for filtering instead of mining

Hello @ZenBel, sorry for the delay in responding.

The margin score is computed as described in this paper: https://arxiv.org/pdf/1911.04944.pdf

The margin between two candidate sentences x and y is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions.

This means that you need a set of neighbours to compute the margin, and not just a pair of sentences. For filtering as you describe, you would probably not have this, as you only have left/right pairs. In this case, you can still do some filtering based on the LASER embedding, but not using the margin. Most probably you would want to just do a cosine score between the embedding of each sentence in your pair. However if you have enough data, you could find neighbours in your bitext.

There is code here: https://github.com/facebookresearch/stopes/blob/main/stopes/modules/bitext/laser_scorer.py that does something pretty close to what you want.

facebookresearch / stopes

Using `stopes` for filtering instead of mining #23