MantisAI / nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
MIT License
159 stars 19 forks source link

[Feature Request] Overlapping Ratio Threshold support #81

Open ain-soph opened 1 month ago

ain-soph commented 1 month ago

Overlapping Ratio

Currently, find_overlap will be True when any single overlap occurs.

https://github.com/MantisAI/nervaluate/blob/df0e695645b9d8cd78017552a5a9cc8734f82bf8/src/nervaluate/evaluate.py#L330 https://github.com/MantisAI/nervaluate/blob/df0e695645b9d8cd78017552a5a9cc8734f82bf8/src/nervaluate/utils.py#L85-L104

However, in most cases, we hope there could be an overlapping ratio threshold. That is something like this

pred = {'start':10, 'end':15}
label = {'start':12, 'end':18}

# calculate union and intersection
union = {'start':10, 'end':18}
intersection = {'start':12, 'end':15}

#calculate ratio
ratio = (15-12) / (18-10)

return ratio > threshold

The current find_overlap uses set operation to find overlaps, which seems to be time inefficient. It would be directly obtained via start and end values:

Here's my implementation:


def find_overlap(self, true: dict[str, int | str], pred: dict[str, int | str]) -> bool:
    start_max = max(true['start'], pred['start'])
    end_min = min(true['end'], pred['end'])
    if start_max >= end_min:
        return False
    start_min = min(true['start'], pred['start'])
    end_max = max(true['end'], pred['end'])
    overlap_ratio = (end_min - start_max) / (end_max - start_min)
    return overlap_ratio > self.overlap_ratio_threshold

Last Character excluded

I wonder why we consider the last token, which is very counter-intuition. This comes from https://github.com/MantisAI/nervaluate/pull/32. Maybe @aflueckiger could provide any explanation on this? Does your data end includes the last character?

I think for most data, the start and end are the offsets in the original text string: text[start:end] which means the last character is excluded. text[1:3] and text[3:5] don't have any overlapping. https://github.com/MantisAI/nervaluate/blob/df0e695645b9d8cd78017552a5a9cc8734f82bf8/src/nervaluate/evaluate.py#L294-L296

Any support for huggingface Evaluate?

Would the maintainers consider using the standard of huggingface Evaluate? which means inheriting evaluate.Metric and pushing to huggingface hub. Afterwards, users could directly call metric = evaluate.load('{hub_url}')

Example: https://huggingface.co/spaces/evaluate-metric/glue/blob/main/glue.py

aflueckiger commented 1 month ago

The original code as well as my additions are indeed confusing. This boils down to the CoNLL format, which is based on token-offsets instead char-offsets when I see correctly.

Since we have token indices, we have to add one to take into account the last token.

>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']

>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']

It is baffling that there is still no up-to-date and no-hacky NER evaluation procedure. A huggingface implementation that gets rid of historic assumptions is absolutely desirable. I also think that an overlap ratio is a good idea.

Yet, I don't have the capacity for a HF implementation since NER has not been my focus recently.

What is the plan with this repo, @ivyleavedtoadflax @davidsbatista?

ain-soph commented 1 month ago

@aflueckiger thanks for your positive feedback. But I'm still confused about the "last token included" part.

["Test", "my", "DOUBLE", "ORG"]  # len(4)

when we want to get the ["DOUBLE", "ORG"], it shall be [2:4] for sure. The entity shall be {'start': 2, 'end': 4}.
I don't see any reason to be the case of [2:3]

You mean in your data, the entity is recorded as {'start': 2, 'end': 3}? I think that data record format is strange to me.

aflueckiger commented 1 month ago

Unless I am totally confused, the token indices are not saved as part of the dataset. They are simply an enumeration of the tokens as the text is tokenized the original format (see here for an example). Thus, we have to add +1 to the end indices. Otherwise we end up with empty spans for single word entities.

# Token IDs
# 0: Test
# 1: my
# 2: DOUBLE
# 3: ORG

>>> "Test my ORG".split()[2:2]
[]
>>> "Test my ORG".split()[2:3]
['ORG']

>>> "Test my DOUBLE ORG".split()[2:4]
['DOUBLE', 'ORG']

I hope it is clearer now. I fully agree that the code is confusing and needs a complete revamp though.