Closed yvesscherrer closed 3 years ago
Confidence zero is used in the cases where the identified language differs from the provided language, so those should be removed. If you don't want anything to be removed, but still want to use the filter, a negative threshold could be used.
I guess it may be that the language identifier picks up the correct language but gives zero confidence to it, and you'd like to keep those. This is a border case for which I can't currently see a good solution.
Ok, I see the point. My use case was that I wanted to apply an id filter only on one side of the parallel corpus. In my first try, I chose "English" with threshold 0 for the other side and was surprised to see a lot filtered out. Changing the threshold to -1 did the trick. I'm fine with leaving it this way, but it might be a good idea to document the use case of one-side filtering.
Added mention of the negative threshold trick in https://github.com/Helsinki-NLP/OpusFilter/commit/0407a5d86606e217f223bf781438a8c687891c31
In the language id filter, the comparison should be
>=
instead of>
I suppose. Otherwise, if I set a threshold of 0 and the score is 0, the sentence is removed.return all(conf > threshold for conf, threshold in zip(score, self.thresholds))