feat: update implementation of GeneralSemanticRobustness to use Transform-based approach

Description of changes: This PR updates the GeneralSemanticRobustness algorithm such that evaluation logic is implemented via the use of Transforms. This PR is analogous to #214, but for General Semantic Robustness (GSR).

The diff indicates that many files have been changed, but all of the changes to code that is not related to GSR are due to my renaming of some constants (RANDOM_UPPER_CASE -> RANDOM_UPPERCASE and WHITESPACE_ADD_REMOVE -> ADD_REMOVE_WHITESPACE).

The main files to review are the following:

src/fmeval/eval_algorithms/general_semantic_robustness.py which obviously contains the new implementation of GSR.
src/fmeval/transforms/semantic_robustness_metrics.py which contains the implementations of the BertScore Dissimilarity and WER metrics.
test/unit/eval_algorithms/test_general_semantic_robustness.py which contains a complete rework of the unit tests for GSR. Note that when reviewing this file, it may be easier to just look at the file in its entirety instead of the diff.

Aside from these main changes, I've introduced a new util function create_output_key and a new Mean transform. Both of these are used by GSR.

Note that the expected values for the Bertscore Dissimilarity and WER scores in the GSR integration tests are now different due to the fact that the implementation of the semantic perturbation transforms uses the latest Numpy APIs for random number generation. See this commit from #215.

In order to verify that the numerical score values are changing solely due to the new random number generation APIs and not due to bugs in my re-implementation of GSR, I did the following:

Run the evaluate_sample integration tests on the current main branch code, and take note of the scores that are generated
Run the evaluate_sample integration tests using the new Transform-based implementation, but change the code in semantic_perturbations.py so that we use the old numpy and random RNG code.
Make sure that the scores generated by step 1 match those generated by step 2.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

LGTM. The semantic robustness integration tests seem to be taking significantly longer with the changes in this PR. Did you notice this locally as well?

Just saw that the integ tests timed out on codebuild. This doesn't happen locally; I can run all of the evaluate_sample tests cases in several minutes, and the evaluate test takes like 8 min. We noticed this extremely long running time when Bilal first submitted his PR which added the baselining logic to GSR, but I thought the long runtime was caused by multiple evaluate test cases. To reduce the runtime, we changed the integ tests to only test a single perturbation instead of all three. The fact that evaluate_sample is taking so long though is worrying and strange.

aws / fmeval

feat: update implementation of GeneralSemanticRobustness to use Transform-based approach #222