More improvements to NEL evaluation

Because

Latest metrics in the 20230824 evaluation are pretty low. This could be due to several reasons:

comparison of Span is more strict, since it is based on character offsets rather than the strings themselves. System annotations must have the exact same character offsets as gold annotations in order to be considered a match. Substrings are no longer considered a match.
Annotations are additionally compared on their type property (i.e. the category of the System entity must match the gold annotation). Even if a system annotation has the correct Span and KBID, it would still be a miss if the type does not match the gold.

It could be insightful to add a more fine-grained evaluation for each annotation property. Specifically, by computing precision, recall, and F1 for (some options)--

Span alone ?
Span + KBID ?
Span + type ?

If metrics are particularly low for one of these compared to others, it might show where the app could be improved.

Done when

More fine-grained evaluation is implemented (or we decide it's not necessary).

Additional context

No response

clamsproject / aapb-evaluations

More improvements to NEL evaluation #30

Because

Done when

Additional context