Score comparison against hand computed ground truth for downloaded benchmarks
Fix bug in mislabeling detection and shortcut detection: now the explainers are initialized with the modified datasets
rename benchmark.train_dataset and benchmark.clean_dataset to benchmark.base_dataset, in the benchmarks which have different notions of train_dataset and base_dataset. for benchmarks like ClassDetection, we only have train_dataset.
rename the train_dataset parameter of benchmark.process_dataset to just dataset, since now sometimes we process base_dataset instead of train_dataset
Minor issue:
If we change the order of test_shortcut_detection_download_sanity_checks and test_shortcut_detection_download, they fail. The tests pass in the current version, also when they are run independently. This does not effect the validity of our tests.
benchmark.train_dataset
andbenchmark.clean_dataset
tobenchmark.base_dataset
, in the benchmarks which have different notions of train_dataset and base_dataset. for benchmarks like ClassDetection, we only havetrain_dataset
.train_dataset
parameter ofbenchmark.process_dataset
to justdataset
, since now sometimes we processbase_dataset
instead oftrain_dataset
Minor issue: If we change the order of
test_shortcut_detection_download_sanity_checks
andtest_shortcut_detection_download
, they fail. The tests pass in the current version, also when they are run independently. This does not effect the validity of our tests.