Closed dilyabareeva closed 2 months ago
What do you think about this kind of classification of benchmarks for Quanda:
Leave-k-out Counterfactuals:
ML Related Tasks:
Input Dependence Sanity Checks:
This would correspond to grouping Model Randomization and TopKOverlap because they test dependence on the inputs of the explanations scheme. Then we group dataset cleaning with the "localization" benchmarks and call them "ML tasks". Probably, there is a better name for it.
These could be Figures 1.a,b and c. We can give spyder plots/mislabeling detection curves/tables in these subfigures. Different datasets can be color coded. And these subfigures don't need to be presented in the standard academic fashion. Maybe there is a better way to organize them. They could also possibly be incorporated into a single figure. I just think this division makes sense for the paper.
unrelated idea: Top-k-overlap and top-k-localization metrics also generate different values as k changes. Maybe for these benchmarks, we can generate a plot with a line for each explainer as k=1,2,3,4,...
We need a nice-looking, attention-grabbing Figure 1 that encapsulates the idea of our library. Some features it might have: