chore: remove XSUM dataset from example notebook and integration tests

Description of changes: This PR is a follow-up of #191, where the last traces of the XSUM dataset are removed from the codebase. The integration tests that used XSUM now use Gigaword, and have had their expected values updated.

This PR also updates all of the integration tests so that ray.shutdown() is called in between the tests for each evaluation algorithm. This is used to clean up resources in between tests, and has reduced the mask disk usage during testing from ~18 GB to ~6 GB.

Lastly, this PR moves the initialization of the SummarizationAccuracy object in test_summarization_accuracy.py from the top of the file into the test method. This is required because code at the top level of every file gets run at the very start of testing, before any tests are executed. This means that the BertscoreHelperModel actor created by the SummarizationAccuracy object also gets created right from the beginning. When we call ray.shutdown() the first time, it will clean up the BertscoreHelperModel resource, meaning that by the time we execute the summarization accuracy integ test, said actor will not exist as expected, and the test will fail.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aws / fmeval

chore: remove XSUM dataset from example notebook and integration tests #192