crucial fixes for evaluation

aflueckiger commented 4 years ago

I am part of the team behind the shared task CLEF-HIPE-2020, for which I adapted @davidsbatista's evaluation procedure. It seems that the scorer is maintained by you meanwhile, so it makes sense to open the pull request here.

During our sanity checks, we came across several crucial bugs in the original code. These bugs are severe as they lead to incorrect evaluation results in some cases. Currently, they are not caught by your tests.

Please check this pull request that fixes the following:

overlaps with true entities must only be counted once
overlaps needs to take into account last token as well
including entities that spans from the first (start_offset = 0) to the last token in a segment requires an explicit 'start_offset is not None' as 'start_offset = 0' evaluates mistakenly to False
an over-generated entity with a valid tag should be attributed to the respective tag as FP only and not to all tags as currently done

The diff looks messier than it is as I use an automatic code- formatter and forgot to deactivate beforehand, sorry.

Currently, two unit tests fail. I am not sure if this is related to my changes, as it concerns only two functions. Please check by yourself so that we have a double-check.

Happy to answer questions if there are any.

ivyleavedtoadflax commented 3 years ago

Thanks for this @aflueckiger - and sorry it has taken me so long to get to it. I'll review this today.

ivyleavedtoadflax commented 3 years ago

Many thanks for this @aflueckiger, I've updated the tests which were failing due to this:

an over-generated entity with a valid tag should be attributed to the respective tag as FP only and not to all tags as currently done

Your scorer looks great. If you want to incorporate that functionality into this package, you are very welcome to, and this is something I would support.

MantisAI / nervaluate

crucial fixes for evaluation #32