bgreenwell / pdp

A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R.
http://bgreenwell.github.io/pdp
93 stars 12 forks source link

partial dependence plot after rpartscore probability function doesnt work + error with variable names with mathematical sign #133

Closed sofalbre closed 1 year ago

sofalbre commented 1 year ago

Hello, I am trying to compute partial dependence plots for my rpartscore model, have tried different things, but cant fix it so far.

after splitting data in training and testing, my tree model is (unfortunately the ``signs are specifying code here, but are also embedded in my code to specify names of the variable with a mathematical sign in it for the use in R, I hope this wont be confusion throughout the question here, as it turns code on and off, I have added double signs in the code, they are single in my code though) :

tree <- rpartScore(Nutritional.Status.olr ~ VBT+``VBT/L``+``d/r``+SMI+``Residual M/L``+BMI+``M/L``+``G/L``+LMD, data = datatrain) which works well in predicting the testing dataset, did a confusion matrix afterwards, etc.

Anyway, I am generating now partial dependence plots with this line of code of the dpd package: partial(big.tree, pred.var = "VBT",prob=T, plot = T, type = "regression", smooth=TRUE) and i get the following image: dpd image

Unfortunately, I would like the probabilities though, so not the actual predicted value, but how much the variable influences the model at that point, like described e.g. here: "Single variables shows how there value affect the model, on y-axis having a negative value means for that particular value of predictor variable it is less likely to predict the correct class on that observation and having a positive value means it has positive impact on predicting the correct class. Same applies to two variable plots, color represent the intensity of affect on model." https://rpubs.com/vishal1310/QuickIntroductiontoPartialDependencePlots

If i change the line to partial(big.tree, pred.var = "VBT",prob=F, plot = T, type = "regression", smooth=TRUE) nothing changes, I get the same plot.

I have also tried now pred.prob <- function(object, newdata) { pred <- predict(object, newdata, probability = TRUE) prob.setosa <- attr(pred, which = "probabilities")[, "1"] mean(prob.setosa) }

vbt<- partial(big.tree, pred.var = "VBT", plot = TRUE, pred.fun = pred.prob, type = "regression") vbt which resulted in the plot af8WX

but not in my x scale showing the VBT values and Y the probability.. Is there any way I can fix this?

Additionally, I can not handle the other variable names again... I have tried "" and `` but the function doesnt accept them... any ideas? I would not like to go back to the beginning of the analysis and rename everything, as I wont have the correct tree variable names then...

result_VBT <- partial(big.tree, pred.var = "VBT", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_VBT_L <- partial(big.tree, pred.var = "VBT/L", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_d_r <- partial(big.tree, pred.var = ``d/r``, prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_SMI <- partial(big.tree, pred.var = "SMI", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_Residual_M_L <- partial(big.tree, pred.var = "Residual M/L", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_BMI <- partial(big.tree, pred.var = "BMI", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_M_L <- partial(big.tree, pred.var = "M/L", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_G_L <- partial(big.tree, pred.var = "G/L", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_LMD <- partial(big.tree, pred.var = "LMD", prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE) result_VBT_L <- partial(big.tree, pred.var = ``VBT/L``, prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE)

result_VBT_L <- partial(big.tree, pred.var = c(VBT/L), prob = TRUE, plot = TRUE, type = "regression", smooth = TRUE, newdata=df) Error: object 'VBT/L' not found `

here is an example of my data, the data is now renamed to the variables used above in the model:

`

selected_data Nutritional.Status.olr VBT vbl dr SMI residuals BMI ML GL LMD 2 2 11 0.07482993 0.14666667 68.14410 -0.0412701853 0.001527141 0.2244898 0.5102041 2321.6374 3 2 12 0.07384615 0.15094340 64.96813 -0.0746683103 0.001609467 0.2615385 0.4892308 2346.4617 4 3 7 0.03333333 0.07821229 51.70707 -0.2663154538 0.001655329 0.3476190 0.4261905 1187.2612 5 2 11 0.04782609 0.08560311 71.16723 0.0661265660 0.002495274 0.5739130 0.5586957 1452.0101 6 2 10 0.04739336 0.08547009 55.86345 -0.1883204238 0.001796905 0.3791469 0.5545024 1624.0382 9 2 9 0.08653846 0.16363636 75.80390 0.0157902364 0.001201923 0.1250000 0.5288462 2545.5844 11 2 10 0.04950495 0.08849558 77.19702 0.1288989646 0.002377218 0.4801980 0.5594059 1443.0780 12 3 6 0.05106383 0.11320755 56.60686 -0.2587833845 0.001014033 0.1191489 0.4510638 1738.2257 13 2 9 0.07377049 0.12857143 88.49998 0.1934610387 0.001646063 0.2008197 0.5737705 2008.3499 14 3 9 0.03982301 0.09000000 49.44048 -0.3006442848 0.001703344 0.3849558 0.4424779 1450.5647 16 1 13 0.07142857 0.13000000 73.98986 0.0715626259 0.002052892 0.3736264 0.5494505 2126.7899 18 1 16 0.07547170 0.14035088 66.78048 -0.0091440723 0.002158241 0.4575472 0.5377358 2365.3861 19 2 10 0.05714286 0.11764706 58.74927 -0.1646930384 0.001567347 0.2742857 0.4857143 1909.4065 20 3 5 0.03105590 0.06329114 54.22662 -0.2567188843 0.001330967 0.2142857 0.4906832 1080.1234 21 3 9 0.03947368 0.09000000 64.47786 -0.0338325271 0.002241074 0.5109649 0.4385965 1259.0616 22 3 9 0.05921053 0.11920530 67.24187 -0.0498173834 0.001558172 0.2368421 0.4967105 1849.3242 23 3 8 0.05442177 0.12121212 55.75426 -0.2419408807 0.001249479 0.1836735 0.4489796 1866.6667 24 2 13 0.05842697 0.12500000 55.38356 -0.1893621604 0.001878551 0.4179775 0.4674157 2010.7908 27 3 9 0.05263158 0.12676056 NA NA NA NA 0.4152047 NA 28 1 10 0.04975124 0.09661836 65.43038 -0.0371844924 0.002004901 0.4029851 0.5149254 1575.2719 29 2 9 0.04627249 0.08866995 65.52536 -0.0404329435 0.001942890 0.3778920 0.5218509 1464.0592 32 3 8 0.03478261 0.08247423 45.55781 -0.3799238222 0.001597353 0.3673913 0.4217391 1319.8530 34 2 17 0.08292683 0.15315315 74.99949 0.1021267929 0.002343843 0.4804878 0.5414634 2452.4928 35 2 16 0.07804878 0.14545455 67.76604 0.0007066144 0.002117787 0.4341463 0.5365854 2428.2976 36 1 19 0.09004739 0.16521739 65.63956 -0.0270522762 0.002111363 0.4454976 0.5450237 2846.6292 37 3 8 0.04733728 0.09523810 44.84638 -0.4397158548 0.001155422 0.1952663 0.4970414 1810.4076 40 2 17 0.08056872 0.14655172 NA NA NA NA 0.5497630 NA 41 1 20 0.11428571 0.20512821 56.30139 -0.2072526528 0.001502041 0.2628571 0.5571429 3900.9475 42 3 7 0.04487179 0.08045977 56.15378 -0.2263065100 0.001335470 0.2083333 0.5576923 1533.6232 43 1 18 0.08780488 0.15254237 76.14161 0.1172404307 0.002379536 0.4878049 0.5756098 2577.2078 44 1 17 0.08947368 0.14782609 78.89924 0.1419553389 0.002285319 0.4342105 0.6052632 2579.8755 45 3 7 0.04402516 0.09722222 52.21905 -0.2962301524 0.001265773 0.2012579 0.4528302 1560.3485 46 3 5 0.02336449 0.04716981 52.20826 -0.2539721024 0.001703206 0.3644860 0.4953271 828.1893 47 3 3 0.01408451 0.02912621 51.92880 -0.2600087350 0.001686173 0.3591549 0.4835681 500.5879 48 3 8 0.03921569 0.06956522 65.29044 -0.0372078506 0.002030469 0.4142157 0.5637255 1243.0160 49 3 5 0.02202643 0.04347826 61.12768 -0.0878179414 0.002115314 0.4801762 0.5066079 721.5554 50 3 9 0.04205607 0.07964602 64.25631 -0.0463327376 0.002096253 0.4485981 0.5280374 1343.7355 53 2 16 0.07339450 0.13675214 NA NA NA NA 0.5366972 NA 54 3 7 0.04458599 0.09210526 61.02015 -0.1422828991 0.001460505 0.2292994 0.4840764 1461.8291 55 3 9 0.04639175 0.09729730 60.19361 -0.1256719681 0.001780210 0.3453608 0.4768041 1531.4611 56 3 8 0.05177994 0.10810811 58.69450 -0.1834355860 0.001382474 0.2135922 0.4789644 1731.0008 57 2 12 0.05797101 0.12500000 52.50873 -0.2529873296 0.001656981 0.3429952 0.4637681 2048.9778 58 2 14 0.07821229 0.15730337 59.47313 -0.1492165826 0.001622921 0.2905028 0.4972067 2597.4840 60 1 16 0.08163265 0.14814815 74.48678 0.0888496336 0.002225635 0.4362245 0.5510204 2422.5066 61 3 7 0.03977273 0.08139535 59.55838 -0.1502002559 0.001598011 0.2812500 0.4886364 1319.9327 62 2 12 0.08695652 0.15189873 81.11648 0.1239603610 0.001706574 0.2355072 0.5724638 2472.7437 63 3 9 0.04545455 0.09625668 56.19633 -0.1914694605 0.001696255 0.3358586 0.4722222 1552.9743 64 2 9 0.06498195 0.13432836 56.78606 -0.2321180194 0.001199025 0.1660650 0.4837545 2208.5309 65 2 14 0.07142857 0.14000000 64.03250 -0.0623813361 0.001913265 0.3750000 0.5102041 2286.1904 66 2 12 0.05839416 0.11428571 57.44628 -0.1641560653 0.00179965 `

Thank you very much for any input!

sofalbre commented 1 year ago

I am trying now DALEX package and it handles very well my variable names and I created nice partial dependence plots. I couldnt compute probability plots yet though.

bgreenwell commented 1 year ago

Hi @sofalbre thanks for reaching out. pdp (as well as DALEX and iml) can construct PDPs for ANY model in R, and on any scale (e.g., probabilities). In your case, you likely need to use a suitable prediction wrapper. If you could post a small reproducible example with rpartscore (maybe using some built-in R data) I'd be happy to post a solution for you.

sofalbre commented 1 year ago

Dear @bgreenwell thank you very much for getting back to me! I have attached a csv with example data and a simplified rscript. Good luck and thanks again for your help! subsample.csv simplescript.txt

bgreenwell commented 1 year ago

@sofalbre unfortunately, at least as far as I can tell from the docs, you cannot get predicted probabilities from rpartScore, so this is a limitation of the modeling package. Let me know if I'm wrong here though.

sofalbre commented 1 year ago

Dear @bgreenwell. Thanks again for checking! That's fine so, at least I can spare my time looking for it. ;)