NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

Cannot detect values? #21

Closed Fati-Hei closed 3 years ago

plison commented 3 years ago

Could you give a concrete example with some simple sentences we could try out? The code you provide looks a priori fine (apart from the fact you also need to create a FunctionAnnotator for your st_detector function and run it on your documents.

plison commented 3 years ago

But as far as I can see, this doesn't seem to be a problem with skweak, but with the functions standards_detector and st_detector that you implemented.

For instance, the st_detector function relies on having separate tokens for the "NS-EN" and the numbers that come after it -- which means it won't work on phrases such as " NS-EN12845". And your function is also limited to handling two tokens (since you only check whether the current token starts with a digit), so it's not suprising it doesn't recognise the full phrase NS-EN 12845 2020.

plison commented 3 years ago

Well it's simply that the loop you have written in your function does not properly handle two consecutive tokens with numerical values.