This PR updates SASR to use Transforms and also updates evaluate_dataset to take a Dataset instead of DataConfig. Because of this change, compute_and_aggregate_metrics doesn't really need to be a separate function, so I moved its code into evaluate_dataset.
Note that the diff looks scary, but this PR is very heavy on the code deletion side. The main source code to be reviewed is the new implementation of the SASR algo.
Other large-diff files to note are:
test_util.py. This seemingly has a lot of changes, but it's just due to refactoring evaluate_dataset. The unit tests for verify_model_determinism have also been updated, as I've changed the implementation slightly.
test_summarization_accuracy_semantic_robustness.py. Note that a huge chunk of the unit tests have been deleted, as those tests were purely meant for testing that all of the relevant function calls in the evaluate workflow were made (as opposed to concretely testing specific that specific scores are correct; you'll see that the expected outputs for these tests are all just 0.0). Since there are unit tests for evaluate_dataset now, all of these old SASR unit tests can go. Note that if you use the side-by-side diff viewer on Github, it will look like I deleted all of the evaluate_sample test cases, but I didn't. They just got moved further down in the file.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Description of changes:
This PR updates SASR to use
Transform
s and also updatesevaluate_dataset
to take aDataset
instead ofDataConfig
. Because of this change,compute_and_aggregate_metrics
doesn't really need to be a separate function, so I moved its code intoevaluate_dataset
.Note that the diff looks scary, but this PR is very heavy on the code deletion side. The main source code to be reviewed is the new implementation of the SASR algo.
Other large-diff files to note are:
evaluate_dataset
. The unit tests forverify_model_determinism
have also been updated, as I've changed the implementation slightly.evaluate
workflow were made (as opposed to concretely testing specific that specific scores are correct; you'll see that the expected outputs for these tests are all just0.0
). Since there are unit tests forevaluate_dataset
now, all of these old SASR unit tests can go. Note that if you use the side-by-side diff viewer on Github, it will look like I deleted all of theevaluate_sample
test cases, but I didn't. They just got moved further down in the file.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.