aws / fmeval

Foundation Model Evaluations Library
http://aws.github.io/fmeval
Apache License 2.0
185 stars 42 forks source link

feat: update implementation of GeneralSemanticRobustness to use Transform-based approach #222

Closed danielezhu closed 6 months ago

danielezhu commented 6 months ago

Description of changes: This PR updates the GeneralSemanticRobustness algorithm such that evaluation logic is implemented via the use of Transforms. This PR is analogous to #214, but for General Semantic Robustness (GSR).

The diff indicates that many files have been changed, but all of the changes to code that is not related to GSR are due to my renaming of some constants (RANDOM_UPPER_CASE -> RANDOM_UPPERCASE and WHITESPACE_ADD_REMOVE -> ADD_REMOVE_WHITESPACE).

The main files to review are the following:

Aside from these main changes, I've introduced a new util function create_output_key and a new Mean transform. Both of these are used by GSR.

Note that the expected values for the Bertscore Dissimilarity and WER scores in the GSR integration tests are now different due to the fact that the implementation of the semantic perturbation transforms uses the latest Numpy APIs for random number generation. See this commit from #215.

In order to verify that the numerical score values are changing solely due to the new random number generation APIs and not due to bugs in my re-implementation of GSR, I did the following:

  1. Run the evaluate_sample integration tests on the current main branch code, and take note of the scores that are generated
  2. Run the evaluate_sample integration tests using the new Transform-based implementation, but change the code in semantic_perturbations.py so that we use the old numpy and random RNG code.
  3. Make sure that the scores generated by step 1 match those generated by step 2.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

danielezhu commented 6 months ago

LGTM. The semantic robustness integration tests seem to be taking significantly longer with the changes in this PR. Did you notice this locally as well?

Just saw that the integ tests timed out on codebuild. This doesn't happen locally; I can run all of the evaluate_sample tests cases in several minutes, and the evaluate test takes like 8 min. We noticed this extremely long running time when Bilal first submitted his PR which added the baselining logic to GSR, but I thought the long runtime was caused by multiple evaluate test cases. To reduce the runtime, we changed the integ tests to only test a single perturbation instead of all three. The fact that evaluate_sample is taking so long though is worrying and strange.