Open ivyleavedtoadflax opened 3 years ago
Hey, thank you very much for the package! We have been using it for a project and this is definitely a problem, also because it makes the results not deterministic when the annotations are passed in a different order.
This happens when multiple entities from the ground truth are predicted as one in the prediction, as correctly identified in the issue linked above. We have built a small MWE to show the problem:
from nervaluate import Evaluator
import pandas as pd
gt1 = [
[
{"start": 0, "end": 12, "label": "A"},
{"start": 14, "end": 17, "label": "B"},
]
]
gt2 = [
[
{"start": 14, "end": 17, "label": "B"},
{"start": 0, "end": 12, "label": "A"},
]
]
pred = [
[
{"start": 0, "end": 17, "label": "A"},
]
]
classes = ["A", "B"]
if __name__ == "__main__":
for i in range(2):
print(f"Run {i}")
if i == 0:
gt = gt1
else:
gt = gt2
evaluator = Evaluator(gt, pred, tags=classes)
results, results_by_tag = evaluator.evaluate()
df_results = pd.DataFrame(results).T
int_cols = [
c for c in df_results.columns if c not in ["precision", "recall", "f1"]
]
df_results[int_cols] = df_results[int_cols].astype(int)
df_results.sort_values("f1")
print(df_results)
print()
Which gives the following output
Run 0
correct incorrect partial missed spurious possible actual precision recall f1
ent_type 1 0 0 1 0 2 1 1.0 0.50 0.666667
partial 0 0 1 1 0 2 1 0.5 0.25 0.333333
strict 0 1 0 1 0 2 1 0.0 0.00 0.000000
exact 0 1 0 1 0 2 1 0.0 0.00 0.000000
Run 1
correct incorrect partial missed spurious possible actual precision recall f1
ent_type 0 1 0 1 0 2 1 0.0 0.00 0.000000
partial 0 0 1 1 0 2 1 0.5 0.25 0.333333
strict 0 1 0 1 0 2 1 0.0 0.00 0.000000
exact 0 1 0 1 0 2 1 0.0 0.00 0.000000
If the entity overlap, we would expect that for ent_type
, both correct
and incorrect
are 1, but since only one of the predicted entites gets compared with the ground truth, depending on the order the first entity is marked as correct or incorrect, and the other entity is then marked as missing.
Thank you very much for your time!
Best, Giulia
@karzideh, @jantrienes
hey @giuliabaldini, thanks for taking the time to look at this. This is indeed a problem that needs fixing. If you have time to put together a PR for it that would be super helpful. In the meantime, I'll add it to our backlog.
Hey @ivyleavedtoadflax , I see this is still an issue. Do you have any suggestions to fix it?
Hello everyone, I was trying to solve the problem but I'm not sure on what the desired behavior is. I found a way to take into account both true entities, counting the first time as correct and the second one as incorrect.
Based on @giuliabaldini example, this are the results:
Run 0
correct incorrect partial missed spurious possible actual precision recall f1
ent_type 1 1 0 0 0 2 2 0.5 0.5 0.5
partial 0 0 2 0 0 2 2 0.5 0.5 0.5
strict 0 2 0 0 0 2 2 0.0 0.0 0.0
exact 0 2 0 0 0 2 2 0.0 0.0 0.0
Run 1
correct incorrect partial missed spurious possible actual precision recall f1
ent_type 1 1 0 0 0 2 2 0.5 0.5 0.5
partial 0 0 2 0 0 2 2 0.5 0.5 0.5
strict 0 2 0 0 0 2 2 0.0 0.0 0.0
exact 0 2 0 0 0 2 2 0.0 0.0 0.0
Is this solution accettable? In that case I will send a PR
Hi @coffepowered thanks for your comment. Unfortunately we're not actively working on nervaluate right now. @infopz if you can put in a PR we will review - thanks!
https://github.com/davidsbatista/NER-Evaluation/issues/17