Closed gurunathparasaram closed 6 years ago
Heya,
So I think the information you're missing is that extractEdits
produces different outputs depending on the command line args. The default option actually does compare (start, end, correction)
using line 80-83 and so does produce correction scores.
If you use the -ds
or -dt
flag however, you can switch the scorer into span-based or token-based detection mode, which is more like the situation you described where we only compare (start, end)
edits. This is useful if you want to evaluate a system in terms of how many errors it detected, even if it got the correction wrong.
Hope that helps!
Thanks, Chris.Didn't understand it properly(my bad) for the token-based method and expected token-level correction. First, I thought of the possibility of comparing token-level edits based on categories(like comparing (start,end,cat,correction)
), but I think we can't attribute error category of a correction of multi token edits to each particular token(Correct me if I am wrong). Span based method seems better. Thanks for the explanation.
In
compare_m2.py
, the edits for a coder obtained from theextract_edits()
function are in the form of(start,end):category
.While comparing the extracted edits for the hypothesis and gold corrections in
compareEdits()
function here: https://github.com/chrisjbryant/errant/blob/fb3196e60ba76c4c3d647ffaec8b36b9c0aa3367/compare_m2.py#L100 in the lines below :The edits are first being compared based on their
(start,end)
and then they are checked to see whether their error categories match.If just their
(start,end)
and the error categories for a hypothesis edit and a reference edit are equal, then it is counted as a true positive.Consider the case below: Source sentence: With the risk of being genetically disorder , many individuals have done the decision to undergo genetic testing . Hypothesis sentence: With the risk of being genetically disordered , many individuals have done the decision to undergo genetic testing . Gold correction: With the risk of having genetic disorders , many individuals have made the decision to undergo genetic testing .
In this case, for hypothesis edit is
(6,7):R:NOUN:NUM
and the reference edit is(6,7):R:NOUN:NUM
. Here their (start,end) and error categories are same and hence, they are being counted as true positive.As far as I understand, since we are not comparing the actual correction tokens 'disordered' vs 'disorders', does it inflate the number of true positives? Is there any reasoning behind just comparing the start,end and error category of the edits that I am missing?
Will it be better if the corrected tokens in the hypothesis edit as well as the reference edit are also compared before counting it as a true positive? Thanks.