MantisAI / nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
MIT License
154 stars 19 forks source link

true_which_overlapped_with_pred does not get updated properly (1) #40

Open ivyleavedtoadflax opened 3 years ago

ivyleavedtoadflax commented 3 years ago

https://github.com/davidsbatista/NER-Evaluation/issues/17

giuliabaldini commented 1 year ago

Hey, thank you very much for the package! We have been using it for a project and this is definitely a problem, also because it makes the results not deterministic when the annotations are passed in a different order.

This happens when multiple entities from the ground truth are predicted as one in the prediction, as correctly identified in the issue linked above. We have built a small MWE to show the problem:

from nervaluate import Evaluator
import pandas as pd

gt1 = [
    [
        {"start": 0, "end": 12, "label": "A"},
        {"start": 14, "end": 17, "label": "B"},
    ]
]
gt2 = [
    [
        {"start": 14, "end": 17, "label": "B"},
        {"start": 0, "end": 12, "label": "A"},
    ]
]

pred = [
    [
        {"start": 0, "end": 17, "label": "A"},
    ]
]
classes = ["A", "B"]
if __name__ == "__main__":
    for i in range(2):
        print(f"Run {i}")
        if i == 0:
            gt = gt1
        else:
            gt = gt2
        evaluator = Evaluator(gt, pred, tags=classes)
        results, results_by_tag = evaluator.evaluate()
        df_results = pd.DataFrame(results).T
        int_cols = [
            c for c in df_results.columns if c not in ["precision", "recall", "f1"]
        ]
        df_results[int_cols] = df_results[int_cols].astype(int)
        df_results.sort_values("f1")
        print(df_results)
        print()

Which gives the following output

Run 0
          correct  incorrect  partial  missed  spurious  possible  actual  precision  recall        f1
ent_type        1          0        0       1         0         2       1        1.0    0.50  0.666667
partial         0          0        1       1         0         2       1        0.5    0.25  0.333333
strict          0          1        0       1         0         2       1        0.0    0.00  0.000000
exact           0          1        0       1         0         2       1        0.0    0.00  0.000000

Run 1
          correct  incorrect  partial  missed  spurious  possible  actual  precision  recall        f1
ent_type        0          1        0       1         0         2       1        0.0    0.00  0.000000
partial         0          0        1       1         0         2       1        0.5    0.25  0.333333
strict          0          1        0       1         0         2       1        0.0    0.00  0.000000
exact           0          1        0       1         0         2       1        0.0    0.00  0.000000

If the entity overlap, we would expect that for ent_type, both correct and incorrect are 1, but since only one of the predicted entites gets compared with the ground truth, depending on the order the first entity is marked as correct or incorrect, and the other entity is then marked as missing.

Thank you very much for your time!

Best, Giulia

@karzideh, @jantrienes

ivyleavedtoadflax commented 1 year ago

hey @giuliabaldini, thanks for taking the time to look at this. This is indeed a problem that needs fixing. If you have time to put together a PR for it that would be super helpful. In the meantime, I'll add it to our backlog.

coffepowered commented 1 year ago

Hey @ivyleavedtoadflax , I see this is still an issue. Do you have any suggestions to fix it?

infopz commented 1 year ago

Hello everyone, I was trying to solve the problem but I'm not sure on what the desired behavior is. I found a way to take into account both true entities, counting the first time as correct and the second one as incorrect.

Based on @giuliabaldini example, this are the results:

Run 0
          correct  incorrect  partial  missed  spurious  possible  actual  precision  recall   f1
ent_type        1          1        0       0         0         2       2        0.5     0.5  0.5
partial         0          0        2       0         0         2       2        0.5     0.5  0.5
strict          0          2        0       0         0         2       2        0.0     0.0  0.0
exact           0          2        0       0         0         2       2        0.0     0.0  0.0

Run 1
          correct  incorrect  partial  missed  spurious  possible  actual  precision  recall   f1
ent_type        1          1        0       0         0         2       2        0.5     0.5  0.5
partial         0          0        2       0         0         2       2        0.5     0.5  0.5
strict          0          2        0       0         0         2       2        0.0     0.0  0.0
exact           0          2        0       0         0         2       2        0.0     0.0  0.0

Is this solution accettable? In that case I will send a PR

ivyleavedtoadflax commented 1 year ago

Hi @coffepowered thanks for your comment. Unfortunately we're not actively working on nervaluate right now. @infopz if you can put in a PR we will review - thanks!