Open ain-soph opened 1 month ago
The original code as well as my additions are indeed confusing. This boils down to the CoNLL format, which is based on token-offsets instead char-offsets when I see correctly.
Since we have token indices, we have to add one to take into account the last token.
>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']
>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']
It is baffling that there is still no up-to-date and no-hacky NER evaluation procedure. A huggingface implementation that gets rid of historic assumptions is absolutely desirable. I also think that an overlap ratio is a good idea.
Yet, I don't have the capacity for a HF implementation since NER has not been my focus recently.
What is the plan with this repo, @ivyleavedtoadflax @davidsbatista?
@aflueckiger thanks for your positive feedback. But I'm still confused about the "last token included" part.
["Test", "my", "DOUBLE", "ORG"] # len(4)
when we want to get the ["DOUBLE", "ORG"]
, it shall be [2:4]
for sure. The entity shall be {'start': 2, 'end': 4}
.
I don't see any reason to be the case of [2:3]
You mean in your data, the entity is recorded as {'start': 2, 'end': 3}
? I think that data record format is strange to me.
Unless I am totally confused, the token indices are not saved as part of the dataset. They are simply an enumeration of the tokens as the text is tokenized the original format (see here for an example). Thus, we have to add +1 to the end indices. Otherwise we end up with empty spans for single word entities.
# Token IDs
# 0: Test
# 1: my
# 2: DOUBLE
# 3: ORG
>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']
>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']
I hope it is clearer now. I fully agree that the code is confusing and needs a complete revamp though.
Overlapping Ratio
Currently,
find_overlap
will be True when any single overlap occurs.https://github.com/MantisAI/nervaluate/blob/df0e695645b9d8cd78017552a5a9cc8734f82bf8/src/nervaluate/evaluate.py#L330 https://github.com/MantisAI/nervaluate/blob/df0e695645b9d8cd78017552a5a9cc8734f82bf8/src/nervaluate/utils.py#L85-L104
However, in most cases, we hope there could be an
overlapping ratio threshold
. That is something like thisThe current
find_overlap
uses set operation to find overlaps, which seems to be time inefficient. It would be directly obtained viastart
andend
values:Here's my implementation:
Last Character excluded
I wonder why we consider the last token, which is very counter-intuition. This comes from https://github.com/MantisAI/nervaluate/pull/32. Maybe @aflueckiger could provide any explanation on this? Does your data
end
includes the last character?I think for most data, the start and end are the offsets in the original text string:
text[start:end]
which means the last character is excluded.text[1:3]
andtext[3:5]
don't have any overlapping. https://github.com/MantisAI/nervaluate/blob/df0e695645b9d8cd78017552a5a9cc8734f82bf8/src/nervaluate/evaluate.py#L294-L296Any support for huggingface Evaluate?
Would the maintainers consider using the standard of huggingface Evaluate? which means inheriting
evaluate.Metric
and pushing to huggingface hub. Afterwards, users could directly callmetric = evaluate.load('{hub_url}')
Example: https://huggingface.co/spaces/evaluate-metric/glue/blob/main/glue.py