Difference between paper equations and code

In Equation 7 of the paper, my understanding is that you need to compute the precision/recall of each ngram order, and then this is averaged over the maximum order of ngrams (which is 4). Only after that, you calculate the F1 score of each operation, and then compute SARI/STAR by averaging them:

add_precision = (add_precision_1 + add_precision_2 + add_precision_3 + add_precision_4) / 4
add_recall = (add_recall_1 + add_recall_2 + add_recall_3 + add_recall_4) / 4
add_f1 = 2 * add_precision * add_recall / (add_precision + add_recall)

keep_precision = (keep_precision_1 + keep_precision_2 + keep_precision_3 + keep_precision_4) / 4
keep_recall = (keep_recall_1 + keep_recall_2 + keep_recall_3 + keep_recall_4) / 4
keep_f1 = 2 * keep_precision * keep_recall / (keep_precision + keep_recall)

del_precision = (del_precision_1 + del_precision_2 + del_precision_3 + del_precision_4) / 4

sari = (add_f1 + keep_f1 + dep_precision) / 3

However, the code follows a different procedure. There, a F1 score (for each operation) is computed for each ngram order. These are accumulated (averaged by the maximum ngram order) and divided by 3 (the number of operations) in the end.

add_f1_1 = 2 * add_precision_1 * add_recall_1 / (add_precision_1 + add_recall_1)
add_f1_2 = 2 * add_precision_2 * add_recall_2 / (add_precision_2 + add_recall_2)
add_f1_3 = 2 * add_precision_3 * add_recall_3 / (add_precision_3 + add_recall_3)
add_f1_4 = 2 * add_precision_4 * add_recall_4 / (add_precision_4 + add_recall_4)

add_1 = (add_f1_1 + add_f1_2 + add_f1_3 + add_f1_4) / 4

keep_f1_1 = 2 * keep_precision_1 * keep_recall_1 / (keep_precision_1 + keep_recall_1)
keep_f1_2 = 2 * keep_precision_2 * keep_recall_2 / (keep_precision_2 + keep_recall_2)
keep_f1_3 = 2 * keep_precision_3 * keep_recall_3 / (keep_precision_3 + keep_recall_3)
keep_f1_4 = 2 * keep_precision_4 * keep_recall_4 / (keep_precision_4 + keep_recall_4)

keep_1 = (keep_f1_1 + keep_f1_2 + keep_f1_3 + keep_f1_4) / 4

del_precision = (del_precision_1 + del_precision_2 + del_precision_3 + del_precision_4) / 4

sari = (add_f1 + keep_f1 + dep_precision) / 3

These are not mathematically equivalent, so the scores produced by both ways of calculating the metric are different. Which is the correct process then? The one in the paper or the one in the code?

Thanks for your help and clarification.

cocoxu / simplification

Difference between paper equations and code #8