Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

Language id filter comparison #7

Closed yvesscherrer closed 3 years ago

yvesscherrer commented 3 years ago

In the language id filter, the comparison should be >= instead of > I suppose. Otherwise, if I set a threshold of 0 and the score is 0, the sentence is removed. return all(conf > threshold for conf, threshold in zip(score, self.thresholds))

svirpioj commented 3 years ago

Confidence zero is used in the cases where the identified language differs from the provided language, so those should be removed. If you don't want anything to be removed, but still want to use the filter, a negative threshold could be used.

I guess it may be that the language identifier picks up the correct language but gives zero confidence to it, and you'd like to keep those. This is a border case for which I can't currently see a good solution.

yvesscherrer commented 3 years ago

Ok, I see the point. My use case was that I wanted to apply an id filter only on one side of the parallel corpus. In my first try, I chose "English" with threshold 0 for the other side and was surprised to see a lot filtered out. Changing the threshold to -1 did the trick. I'm fine with leaving it this way, but it might be a good idea to document the use case of one-side filtering.

svirpioj commented 3 years ago

Added mention of the negative threshold trick in https://github.com/Helsinki-NLP/OpusFilter/commit/0407a5d86606e217f223bf781438a8c687891c31