NorskRegnesentral / shapr

Explaining the output of machine learning models with more accurately estimated Shapley values
https://norskregnesentral.github.io/shapr/
Other
147 stars 34 forks source link

Computational Issues - Shapr #415

Closed hanneleer closed 1 week ago

hanneleer commented 1 week ago

Dear all,

I was wondering whether someone has already tried to run the shapr package on more than 10 features, when working with a random forest model. I am looking to apply it to a dataset with at least 20 features, but I’m encountering computational limitations.

Any insights or suggestions would be greatly appreciated! Thanks!

martinju commented 1 week ago

Hi. Can you provide some details on more specifically you have tried and what did not turn out as expected? What kind of model do you have, how many features, the number of training observations and observations to explain, and the type of approach? A complete example with runnable code would be preferrable.

The AI suggestion of parallelization for the prediciton part is pointless. shapr already supports parallellizations of computations in batches.

hanneleer commented 1 week ago

Hi @martinju. Thanks for coming back to me!

I have 25 features with a training sample of 1,735 observations and a testing sample of 744 observations.

The main issue I am encountering is that the running time seems never-ending, and I suspect I may encounter memory limit issues in the end as well (in R). I tried using fewer features (around 6) and fewer combinations, and in those cases, I do see results. However, I already hit the memory limit in R just by doing this. Is this normal? I do already have access to a server which provides me more memory than my laptop. For my analysis, I would like to run all features in shapr.

Let's say I do a simple Random Forest: rf.fit <- ranger::ranger(Y_train~., data = Train1, mtry = 12, #number of features/3 max.depth = 3, replace = F, min.node.size = 5, sample.fraction = 0.4, respect.unordered.factors = "order", importance ="permutation")

explainer <- shapr(Test1, rf.fit, n_combinations = 1000)

p<-mean(Y_train) explanation <- explain( Test1, approach = "copula", explainer = explainer, prediction_zero = p, approximation=T )

Thanks a lot for your time, I really appreciate it!

martinju commented 1 week ago

Hi. Please use the GitHub version of shapr: remotes::install_github("NorskRegnesentral/shapr") It has a somewhat different interface. Check the examples in the readme here on GitHub or the vignettes: https://norskregnesentral.github.io/shapr/ It allows for iterative estimation, with parallelization and batch computation for reduces memory usage, and should make it possible to get some results even if 25 features is quite a bit and will take some time for 744 observations. approach = "gaussian" is typically the fastest.

We are working on releasing the new version to CRAN. Hoping to get it done within a few weeks.

hanneleer commented 1 week ago

Hi @martinju. Thanks a lot! It works - and it also does not take that much of time!

I was also wondering whether there is already a functionality available in the package in order to decide how much features to include in the plot for beeswarm? Something like the top_k_features for the waterfall plot?

Thanks again!