Benchmark fixes and tests for downloaded benchmarks

Sanity check tests for downloaded benchmarks
Score comparison against hand computed ground truth for downloaded benchmarks
Fix bug in mislabeling detection and shortcut detection: now the explainers are initialized with the modified datasets
rename benchmark.train_dataset and benchmark.clean_dataset to benchmark.base_dataset, in the benchmarks which have different notions of train_dataset and base_dataset. for benchmarks like ClassDetection, we only have train_dataset.
rename the train_dataset parameter of benchmark.process_dataset to just dataset, since now sometimes we process base_dataset instead of train_dataset

Minor issue: If we change the order of test_shortcut_detection_download_sanity_checks and test_shortcut_detection_download, they fail. The tests pass in the current version, also when they are run independently. This does not effect the validity of our tests.

dilyabareeva / quanda

Benchmark fixes and tests for downloaded benchmarks #172