Not comparing the actual correction tokens between hypothesis and reference edits in compare_m2.py

gurunathparasaram commented 6 years ago

In compare_m2.py, the edits for a coder obtained from the extract_edits() function are in the form of (start,end):category.

While comparing the extracted edits for the hypothesis and gold corrections in compareEdits() function here: https://github.com/chrisjbryant/errant/blob/fb3196e60ba76c4c3d647ffaec8b36b9c0aa3367/compare_m2.py#L100 in the lines below :

# On occasion, multiple tokens at same span.
for h_cat in ref_edits[h_edit]: # Use ref dict for TP
    tp += 1
    # Each dict value [TP, FP, FN]
    if h_cat in cat_dict.keys():
        cat_dict[h_cat][0] += 1
    else:
                    cat_dict[h_cat] = [1, 0, 0]

The edits are first being compared based on their (start,end) and then they are checked to see whether their error categories match.
If just their (start,end) and the error categories for a hypothesis edit and a reference edit are equal, then it is counted as a true positive.
Consider the case below: Source sentence: With the risk of being genetically disorder , many individuals have done the decision to undergo genetic testing . Hypothesis sentence: With the risk of being genetically disordered , many individuals have done the decision to undergo genetic testing . Gold correction: With the risk of having genetic disorders , many individuals have made the decision to undergo genetic testing .
In this case, for hypothesis edit is (6,7):R:NOUN:NUM and the reference edit is (6,7):R:NOUN:NUM. Here their (start,end) and error categories are same and hence, they are being counted as true positive.
As far as I understand, since we are not comparing the actual correction tokens 'disordered' vs 'disorders', does it inflate the number of true positives? Is there any reasoning behind just comparing the start,end and error category of the edits that I am missing?
Will it be better if the corrected tokens in the hypothesis edit as well as the reference edit are also compared before counting it as a true positive? Thanks.

chrisjbryant commented 6 years ago

Heya,

So I think the information you're missing is that extractEdits produces different outputs depending on the command line args. The default option actually does compare (start, end, correction) using line 80-83 and so does produce correction scores.

If you use the -ds or -dt flag however, you can switch the scorer into span-based or token-based detection mode, which is more like the situation you described where we only compare (start, end) edits. This is useful if you want to evaluate a system in terms of how many errors it detected, even if it got the correction wrong.

Hope that helps!

gurunathparasaram commented 6 years ago

Thanks, Chris.Didn't understand it properly(my bad) for the token-based method and expected token-level correction. First, I thought of the possibility of comparing token-level edits based on categories(like comparing (start,end,cat,correction)), but I think we can't attribute error category of a correction of multi token edits to each particular token(Correct me if I am wrong). Span based method seems better. Thanks for the explanation.

chrisjbryant / errant

Not comparing the actual correction tokens between hypothesis and reference edits in compare_m2.py #3