Update the implementation of the SummarizationAccuracy evaluation algorithm so that it uses the Transform/TransformPipeline approach.
Minor new functions like get_dataset_configs and execute_record alongside their corresponding unit tests.
Remove deepcopy from Transform__init__, as it conflicts with Ray serialization and also didn't provide much value to begin with (and was potentially unexpected/unintuitive).
The largest diff for this PR is in the unit tests for SummarizationAccuracy. Because a lot of code was edited, deleted, moved, reading the diff may not be the best way to view the changes. I'd suggest just jumping to the file directly, and reading through the tests from a fresh perspective, since so much as been changed.
The unit tests for get_meteor_score and get_rouge_score have been moved to test_summarization_accuracy_metrics.py and have been adapted to test the MeteorScore and RougeScore transforms. Note that BertScore isn't tested because unit tests that validate numerical values already exist for the BertscoreModel helper model.
Also note that there was actually some unintended behavior in the original unit tests for get_rouge_score, where we were passing the same config for every parametrized test case (so the rouge_type was always "rouge2", instead of varying based on the test case). This has been fixed in my new unit tests, and as a result, some of the expected values have changed.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Description of changes:
SummarizationAccuracy
evaluation algorithm so that it uses theTransform
/TransformPipeline
approach.get_dataset_configs
andexecute_record
alongside their corresponding unit tests.deepcopy
fromTransform
__init__
, as it conflicts with Ray serialization and also didn't provide much value to begin with (and was potentially unexpected/unintuitive).The largest diff for this PR is in the unit tests for
SummarizationAccuracy
. Because a lot of code was edited, deleted, moved, reading the diff may not be the best way to view the changes. I'd suggest just jumping to the file directly, and reading through the tests from a fresh perspective, since so much as been changed.The unit tests for
get_meteor_score
andget_rouge_score
have been moved totest_summarization_accuracy_metrics.py
and have been adapted to test theMeteorScore
andRougeScore
transforms. Note thatBertScore
isn't tested because unit tests that validate numerical values already exist for theBertscoreModel
helper model.Also note that there was actually some unintended behavior in the original unit tests for
get_rouge_score
, where we were passing the same config for every parametrized test case (so therouge_type
was always"rouge2"
, instead of varying based on the test case). This has been fixed in my new unit tests, and as a result, some of the expected values have changed.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.