The original ROUGE script allows for a scoring option: -f A|B where A means to average over the models and B takes the maximum. Similar functionality should be implemented for PythonRouge. The logic should be identical to ROUGE, so we need to understand the implementation details (how does it compute the "best" model? Is it per metric or does it pick one of the metrics and use that for precision, recall, and f1?)
The original ROUGE script allows for a scoring option:
-f A|B
whereA
means to average over the models andB
takes the maximum. Similar functionality should be implemented forPythonRouge
. The logic should be identical to ROUGE, so we need to understand the implementation details (how does it compute the "best" model? Is it per metric or does it pick one of the metrics and use that for precision, recall, and f1?)