bgreenwell / fastshap

Fast approximate Shapley values in R
https://bgreenwell.github.io/fastshap/
112 stars 18 forks source link

fastshap::explain() uses all cores and not those defined by registerDoParallel(cores=X) #75

Open abussalleuc opened 6 months ago

abussalleuc commented 6 months ago

Hi @bgreenwell

I am using fastshap::explain for a large dataset (4 million rows, 23 columns) on a windows system (512gb ram, 48 cores)

Here is how the code looks like:

t<-fastshap::explain( model,#ranger::ranger() object X=train_set[,vars],# train set used to train the model (~1million rows, but a higher gradient of predictor values) pred_wrapper = pfun,#prediction function: ranger::predict()$predictions newdata=new_data[,vars],#dataset to explain ~4million rows feature_names=NULL#predictor variables of interest (23) nsim=10, adjust=TRUE, parallel = TRUE, .packages=c('ranger') )

Due to the size of my dataset, I don't want to use all cores to avoid memory issues, so I tried defining the paralell backend as: cl<-makePSOCKcluster(25) registerDoParallel(cl) and as: registerDoParallel(cores=25)

In both cases I have noticed (using task manager-perofmance) that all the available cores and not those defined above are being used.

I tried this with data subsets of 4000 rows and changing the number of simulations and cores, but it still uses more cores than those defined by the registerDoParallel command (based on task manager - performance). While explain runs and produces results with the smaller datasets, with the whole dataset it use 100% of my RAM and sometimes the computer crashes.

In total the model, the train_set and the new_data objects weight ~5GB, so I don't think is a good idea to use as much clusters/cores/logical processors as possible.

Am I defining the parallel backend wrong? Should I create a foreach loop for each column instead and with parallel=FALSE?

Thank you for your time. best, Alonso

abussalleuc commented 6 months ago

Capture3 here im using 500k rows and registerDoParallel(cores=10)

brandongreenwell-8451 commented 6 months ago

Thanks @abussalleuc, on a Windows system, multicore functionality in R will not work (e.g., specifying cores=25). In your case, I would try setting up the parallel backend as follows:

cl <- makeCluster(25)
registerDoParallel(cl)

Does this seem to fix the issue on your system?

Further, explaining that many rows, even with nsim=1, is going to be terribly slow I suspect, even with massive parallel processing. I have not tested this on such a large sample or have access to that many cores, so let me know how it works out!

abussalleuc commented 6 months ago

Hi @brandongreenwell-8451 Thank you for your answer.

I was originally using makePSOCKcluster which (to my limited knowledge) should work on a windows machine. Using cl<-makePSOCKcluster (25) | regirsterDoParallel(cl) would still activate all logical processors.

I tried your suggestion and the issue persists. If I use the same X (background or train set), and cut the new_data into smaller subset, and run each subset separately, would this affect how the SHAP values are calculated? I understand that for each column and during each simulation, the predictor values are resampled and new predictions are calculated, but are they resample from the new_data or from the background/train_set? My train set, although smaller, has probably much more variability in the predictors than the new_data.

My idea is to use SHAP values to explain correlations between modeled variables that share the same predictors so my dataset is a very small sample from a much larger spatiotemporal extent......

thank you for your time. best, Alonso