Closed danielezhu closed 6 months ago
LGTM. The semantic robustness integration tests seem to be taking significantly longer with the changes in this PR. Did you notice this locally as well?
Just saw that the integ tests timed out on codebuild. This doesn't happen locally; I can run all of the evaluate_sample
tests cases in several minutes, and the evaluate
test takes like 8 min. We noticed this extremely long running time when Bilal first submitted his PR which added the baselining logic to GSR, but I thought the long runtime was caused by multiple evaluate
test cases. To reduce the runtime, we changed the integ tests to only test a single perturbation instead of all three. The fact that evaluate_sample
is taking so long though is worrying and strange.
Description of changes: This PR updates the
GeneralSemanticRobustness
algorithm such that evaluation logic is implemented via the use ofTransform
s. This PR is analogous to #214, but for General Semantic Robustness (GSR).The diff indicates that many files have been changed, but all of the changes to code that is not related to GSR are due to my renaming of some constants (
RANDOM_UPPER_CASE
->RANDOM_UPPERCASE
andWHITESPACE_ADD_REMOVE
->ADD_REMOVE_WHITESPACE
).The main files to review are the following:
Aside from these main changes, I've introduced a new util function
create_output_key
and a newMean
transform. Both of these are used by GSR.Note that the expected values for the Bertscore Dissimilarity and WER scores in the GSR integration tests are now different due to the fact that the implementation of the semantic perturbation transforms uses the latest Numpy APIs for random number generation. See this commit from #215.
In order to verify that the numerical score values are changing solely due to the new random number generation APIs and not due to bugs in my re-implementation of GSR, I did the following:
evaluate_sample
integration tests on the currentmain
branch code, and take note of the scores that are generatedevaluate_sample
integration tests using the newTransform
-based implementation, but change the code insemantic_perturbations.py
so that we use the oldnumpy
andrandom
RNG code.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.