Applying Shapley Value Analysis: Visualization Plots and Number of Features to Consider

HellenNamulinda commented 8 months ago

We are considering Shapley Value Analysis for the interpretability of our models. One crucial aspect is determining the optimal visualization plots for Shapley values and deciding the number of features to analyze.

Besides the waterfall plot, summary plot, and beeswarm plot(shared: 902c5a95c4dd1b21d07259a02557817f8bc2208e) , exploring additional visualization plots such as force plots and heatmaps may provide further insights into the models.

Also, while we initially started with 22 features, which can be easier to analyse all. It will become more challenging to analyze all as more features are integrated. Considering we just analyze top 10/15?

@miquelduranfrigola, your input and expertise would be greatly appreciated.

miquelduranfrigola commented 8 months ago

Thanks @HellenNamulinda.

I am quite happy with the default visualization capabilities of the SHAP library. However, I've done some extra work in this direction (for example, using tiles for a 2D visualization of Shapley values) that we can recycle, but this is just "cosmetics". I think the by-default Shapley plots are already quite informative. Another question would be whether we can find ways to map this information onto chemicals, this may potentially be useful. Let's discuss in the meeting.

As for the increasing number of features, I completely agree. We need to limit the number of features for interpretation. In my opinion, feature selection to restrict to, say, up to 100 features for Shapley analysis would make sense. I've never tried this package, but it looks good: https://github.com/AutoViML/featurewiz In any case, as always, let's first start with a good-old k-best feature selector (e.g. 100-best) and then we take it from there.

HellenNamulinda commented 5 months ago

Hello @Miquel, Initially, we were saving three plots; the bar plot and beeswarm plot for all features, and a waterfall plot for one sample.

But now, there are several interpretability plots being generated, that is; waterfall plots are generated for 5 samples(at 0, 25, 50, 75 and 100 percentiles). Also, we are generating scatter plots for the 5 most important features(https://github.com/ersilia-os/xai4chem/pull/13/commits/c6ee6187c09ccf3fbc753e14654fda57356f5ae7)

The last one am still working on is mapping the shap values of fingerprints back to chemical structures.

miquelduranfrigola commented 4 months ago

Thanks @HellenNamulinda this is useful and looks good to me.

HellenNamulinda commented 4 months ago

Hello @miquelduranfrigola, The scatter plots show a distribution of the feature values and shap values across the entire dataset. So, what you asked about drawing the bits for those top features is going to require to identify molecules that actually have these features present (onbits) and then draw the bits.

What I was doing initially(this PR) was to draw top features(bits) as well highlight features for only the 5 samples/molecules(at the 5 percentiles).

In that case, if you still want us to draw the top features across the entire dataset, I will adjust.

You can see find sample visualizations in the slides.

miquelduranfrigola commented 4 months ago

Thanks @HellenNamulinda

I think your approach (i.e. highlighting features for the 5 molecules at the percentiles) makes sense.

I feel that, in addition it would be good to draw the top 5 or top 10 features across the entire dataset. I'd like to see if the most discriminative features are indeed informative from a chemical perspective.

Finally, about the visualization in the slides (slide number 7), I see that all atoms map to a top feature (either blue or red). Is this the case for all molecules, or is this just coincidental?

HellenNamulinda commented 4 months ago

Hello @miquelduranfrigola, Thanks for the observations.

I feel that, in addition it would be good to draw the top 5 or top 10 features across the entire dataset. I'd like to see if the most discriminative features are indeed informative from a chemical perspective.

I will work on this.

Finally, about the visualization in the slides (slide number 7), I see that all atoms map to a top feature (either blue or red). Is this the case for all molecules, or is this just coincidental?

It is not the case for all molecules. I have added another molecule, see slide 8 to 10.

Actually, what might cause most atoms to be highlighted is the fact that, when selecting features based on shap values, we are currently selecting top 5 valid features for that particular molecule(those which are nonzero).

Forexample, if the top 5 features(based on shap values) map to off bits, they won't be among those selected to be mapped to the compounds. And may be this wrong. Because the absence of those substructures(off bits) is what contributes to high shap values.

miquelduranfrigola commented 4 months ago

Thanks, this is very clear.

HellenNamulinda commented 4 months ago

Hello @miquelduranfrigola,

For drawing the top bits across the entire dataset, choosing one sample with the highest count might not give a good overview of the fragments available. So, I agree with drawing important fingerprints for like 10 samples that have these top features rather than choosing a single sample, to better understand fragments at each bit.

ersilia-os / xai4chem

Applying Shapley Value Analysis: Visualization Plots and Number of Features to Consider #7