feat: update implementation of SummarizationAccuracy to use Transform-based approach

Description of changes:

Update the implementation of the SummarizationAccuracy evaluation algorithm so that it uses the Transform/TransformPipeline approach.
Minor new functions like get_dataset_configs and execute_record alongside their corresponding unit tests.
Remove deepcopy from Transform __init__, as it conflicts with Ray serialization and also didn't provide much value to begin with (and was potentially unexpected/unintuitive).

The largest diff for this PR is in the unit tests for SummarizationAccuracy. Because a lot of code was edited, deleted, moved, reading the diff may not be the best way to view the changes. I'd suggest just jumping to the file directly, and reading through the tests from a fresh perspective, since so much as been changed.

The unit tests for get_meteor_score and get_rouge_score have been moved to test_summarization_accuracy_metrics.py and have been adapted to test the MeteorScore and RougeScore transforms. Note that BertScore isn't tested because unit tests that validate numerical values already exist for the BertscoreModel helper model.

Also note that there was actually some unintended behavior in the original unit tests for get_rouge_score, where we were passing the same config for every parametrized test case (so the rouge_type was always "rouge2", instead of varying based on the test case). This has been fixed in my new unit tests, and as a result, some of the expected values have changed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aws / fmeval

feat: update implementation of SummarizationAccuracy to use Transform-based approach #214