aymara / lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
http://aymara.github.io/lima/
Other
107 stars 21 forks source link

Wrong entity string output by the BratDumper #121

Closed kleag closed 3 years ago

kleag commented 3 years ago

I have entities which are CVE (Common Vulnerabilities and Exposures) that look like that CVE-2018-5391.

I wrote this rule to recognize them:

using modex SpecificEntities-modex.xml,Decoder-modex.xml
using groups DateTime,Numex,Decoder
set defaultAction=>CreateSpecificEntity()

CVE::- <DATE> - <NUMBER>:CVE:

It works. The entities are recognized in the right place. But the BratDumper writes a line that looks like this: T5 Decoder_CVE 209 222 CVE - 2012 - 6638 with spaces around the tokens. As a result, brat can't match the mention with the string in the text. Is this a bug in BratDumper or an error in my rule?

kleag commented 3 years ago

It was in fact not a LIMA bug but an error in my script converting the conll-u format to brat.