Does `evaluator.evaluate(metric="PGI")` actually compute PGU?

AI4LIFE-GROUP / OpenXAI

OpenXAI : Towards a Transparent Evaluation of Model Explanations

https://open-xai.github.io/

MIT License

227 stars 37 forks source link

Does `evaluator.evaluate(metric="PGI")` actually compute PGU? #28

Open colorlace opened 11 months ago

colorlace commented 11 months ago

As the comment in the code snippet below states, the features in the top-K are kept static. For PGI we want the opposite. We want the features in the top-K to be perturbed and the rest to remain static.

        # keeping features static that are in top-K based on feature mask
        perturbed_samples = original_sample * feature_mask + perturbations * (~feature_mask)

(code snippet from NormalPerturbation.get_perturbed_inputs in explainers/catalog/perturbation_methods.py)

tazitoo commented 11 months ago

Created unit test that shows a mismatch between top k mask and the location of perturbation in this fork.

danwley commented 11 months ago

As the comment in the code snippet below states, the features in the top-K are kept static. For PGI we want the opposite. We want the features in the top-K to be perturbed and the rest to remain static.
        # keeping features static that are in top-K based on feature mask
        perturbed_samples = original_sample * feature_mask + perturbations * (~feature_mask)
(code snippet from NormalPerturbation.get_perturbed_inputs in explainers/catalog/perturbation_methods.py)

Thank all for pointing out the issues!

@colorlace is correct that for PGI we want to perturb the top-K features. The code in evaluator.py is correct and supports this, it solely depends on what mask is passed in via the input_dict variable.

The convention is that the mask should contain 0s for the top-K features and 1s for non-top-K features (so that perturbed_samples matches original_sample where feature_mask is high)

We have updated the comment to say: "keeping features static where the feature mask is high"

danwley commented 11 months ago

Created unit test that shows a mismatch between top k mask and the location of perturbation in this fork.

@tazitoo thanks for the unit test implementation, it's very helpful. The problem is the generate_mask function.

We have updated the generate_mask function to set topk features to 0 in the mask and features outside the topk to 1. This way, in the code snippet @colorlace provided, perturbed_sample will be equal to original_sample for features outside the topk.

We have also updated the function to consider absolute value when computing top-k.

As for the unit test, it still won't pass currently, since we actually want:

assert mask.sum() == len(x) - topk

The mask should be high for features outside the topk i.e. it should be True for each feature being 'masked'.

Once making that change, it passes the tests on my end, including when we have negative feature attributions. Thanks again for pointing out this issue!

tazitoo commented 11 months ago

Thanks for the reply. If there was a bug in the mask function, then the results in the paper and leader board were erroneous? ...it would have resulted in PGI and PGU being inverted...?