Open iacopy opened 4 years ago
Hmm, thanks @iacopy! Most of these look like tokenization errors, leading to misclassification. Some of them also look like reasonable entities to me also. If you can consistently recognise an issue with the tokenization, you can add exceptions to the spacy tokenizer, or re-tokenize after the fact to fix them.
Yeah, I remember I had some code in the tokenizer to deal with parentheses a bit better, but at some point spacy changed from the regex
package to the re
package, and that code required variable width lookbehinds, which re does not support, so it was commented out. Not sure thats the entirety of the problem, but given how many of these have unbalanced parens, i think it is part of it.
Hi, I just report problematic named entities I found using
en_core_sci_sm
, to improve the model. Most of them contain unbalanced brackets.