NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

Missed Annotations #31

Closed david-waterworth closed 2 years ago

david-waterworth commented 2 years ago

The base annotator filters each annotation based on _is_allowed_span https://github.com/NorskRegnesentral/skweak/blob/fba1037399121d5468187aac746f52cb57bc8d31/skweak/base.py#L88

however implementations such as TokenConstraintAnnotator perform additional filtering, they only yield the longest span. This means in cases where the longest span violates _is_allowed_span but there exists a shorter span that is valid (but overlaps) it is not considered.

I think the logic should really be to return the longest valid spans, which means the _is_allowed_span needs to be called in the find_spans method and not __call__ of the base class.

A workaround seems to be to add the name of the annotator itself to the incompatible_sources, and then yield the candidate spans in order of length descending. That way it will return spans that satisfy both constraints.

plison commented 2 years ago

OK, thanks (and sorry for the answering delay...). It should be fixed now, let me know if you still encounter problems.