Closed tengku-hanis closed 2 years ago
Hi,
thank you for your issue, I hope to be able to help.
First of all, let's understand what measure is shown in the plot. It is 1-AUC loss function. It is used instead of standard AUC because we are utilizing the property that lower loss function values indicate that the model has better performance.
That being said, looking at the first plot, we see that the baseline 1-AUC for this model is roughly 0.675, hence AUC is 0.325. Because AUC is symmetrical around 0.5, I suspect you've inverted the labels and that is why you get such extraordinary results, as according to the plot, permutation to glucose made the model significantly better. It obviously does not make sense in general but in this particular case is understandable, taking into consideration the nature of 1-AUC loss. Switching the labels (or using predict_function_target_column
parameter) should solve the issue.
In general, small fluctuations, and even bars spreading to the left are possible even with properly set labels (like in the second plot male_single
variable). It usually means then that the variable is extremely insignificant for a given trained model, as removing the information from them, increases the performance.
I hope this clears out some of your concerns
Thanks, using predict_function_target_column
parameter clear up my confusion.
Hi, sorry if this is not appropriate to be asked here.
I tried to build a classification model and explain using DALEX package. Below is the reprex what I'm trying to do.
Notice in the recipes above, I removed variable pressure and make a new categorical variable (some_new_group).
Next, I try explain this model using DALEX.
Result of
svm_vp
.Here is the plot.
So,based on the plot, the most influential variable is
glucose
right? It does not make sense forsome_new_group
andpressure
variables to be the most important variable as we do not use these variables in the model fitting. I have seen this post and this post and my plot looks a bit different as my most important variable is at the bottom while in both posts the most important variable is at the top. Even the direction of my bar plot is different. I attached one of the plots from the post as a comparison.Did I miss something in the R code? or miss a certain step?