Closed ZenBel closed 1 year ago
Hello @ZenBel, sorry for the delay in responding.
The margin score is computed as described in this paper: https://arxiv.org/pdf/1911.04944.pdf
The margin between two candidate sentences x and y is defined as the ratio between the cosine distance between the two sentence embeddings, and the average cosine similarity of its nearest neighbors in both directions.
This means that you need a set of neighbours to compute the margin, and not just a pair of sentences. For filtering as you describe, you would probably not have this, as you only have left/right pairs. In this case, you can still do some filtering based on the LASER embedding, but not using the margin. Most probably you would want to just do a cosine score between the embedding of each sentence in your pair. However if you have enough data, you could find neighbours in your bitext.
There is code here: https://github.com/facebookresearch/stopes/blob/main/stopes/modules/bitext/laser_scorer.py that does something pretty close to what you want.
Hello,
I have a dataset of English-Italian sentence pairs and I would like to retain only those with a margin score higher than a threshold. what part of the code would I have to change to make
stopes
filter sentence pairs instead of mining them?Thank you,
Z