Increment all stats for end2end evaluation

Problem: CLEval is not always returning the same output results for the same input. More particularly, the values char_false_pos, and recognition_score seem to change.

How to reproduce: When calling python script.py -g=<your_gt_file> -s=<your_res_file> --E2E multiple times (~10), the results for char_false_pos and recognition_score will not always be the same.

Output 1:

{'Detection': {'hmean': 0.868764725054997,
               'precision': 0.9373529411764706,
               'recall': 0.809529627367135},
 'Detection_Metadata': {'char_false_pos': 46,
                        'char_miss': 1485,
                        'char_overlap': 54,
                        'num_false_pos': 45,
                        'num_merge': 251,
                        'num_split': 72},
 'EndtoEnd': {'hmean': 0.6786105884346162,
              'precision': 0.7120913190529876,
              'recall': 0.6481368356750152,
              'recognition_score': 0.08695652173913043},
 'EndtoEnd_Metadata': {'char_false_pos': 76.0,
                       'char_miss': 2880.0,
                       'num_false_pos': 45,
                       'num_merge': 251,
                       'num_split': 72}}

Output 2 (for exact same <gt_file> and <res_file>):

{'Detection': {'hmean': 0.868764725054997,
               'precision': 0.9373529411764706,
               'recall': 0.809529627367135},
 'Detection_Metadata': {'char_false_pos': 46,
                        'char_miss': 1485,
                        'char_overlap': 54,
                        'num_false_pos': 45,
                        'num_merge': 251,
                        'num_split': 72},
 'EndtoEnd': {'hmean': 0.6786105884346162,
              'precision': 0.7120913190529876,
              'recall': 0.6481368356750152,
              'recognition_score': 0.8823529411764706},
 'EndtoEnd_Metadata': {'char_false_pos': 16.0,
                       'char_miss': 2880.0,
                       'num_false_pos': 45,
                       'num_merge': 251,
                       'num_split': 72}}

Explanation: The function accumulate_stats() in script.py is supposed to increment (+=) all end2end variables with results from the latest sample. However, the variables self.e2e_char_false_positive, self.e2e_recog_score_chars, and self.e2e_recog_score_correct_num are not incremented. Instead, their value is fixed (=) to the results from the latest sample. As a consequence, the values of those 3 variables will always be equivalent to the values of the latest sample. Hence, a different order in the samples will cause different end2end results.

Solution: Change the function accumulate_stats() in script.py to increment (+=) the variables self.e2e_char_false_positive, self.e2e_recog_score_chars, and self.e2e_recog_score_correct_num based on the results from the last sample.

clovaai / CLEval

Increment all stats for end2end evaluation #6