easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
965 stars 87 forks source link

Check for influential observations of GLM w/o numeric variables #735

Open arodionoff opened 1 week ago

arodionoff commented 1 week ago

Your performance::check_outliers() function does not allow you to check for influential observations in a logistic regression, the objective function of which is a factor, and there is not a single numeric feature among the predictors.

If there are numerical variable (at least as a target), then everything is fine:

# install.packages(c("smbinning", "randomForest", "performance"))
# Load library and its dataset
library(smbinning)
# Sampling
pop=smbsimdf1 # Population
train=subset(pop,rnd<=0.7) # Training sample
# Generate binning object to generate variables
smbcbs1=smbinning(train,x="cbs1",y="fgood")
smbcbinq=smbinning.factor(train,x="cbinq",y="fgood")
pop=smbinning.gen(pop,smbcbs1,"g1cbs1")
pop=smbinning.factor.gen(pop,smbcbinq,"g1cbinq")
# Resample
train=subset(pop,rnd<=0.7) # Training sample
test=subset(pop,rnd>0.7) # Testing sample
# Run logistic regression with factors
modlogisticsmb=glm(fgood ~ cbinq + cbterm + inc, data = train, family = binomial())
summary(modlogisticsmb)

library(performance)
plot( performance::check_outliers(modlogisticsmb) )

3c14350b-a0b4-45a6-93cd-b221f5255073

But as soon as there are no more of them left, replacing them with a factor, we get an error:

train$fgood <- as.factor(train$fgood)
# Run logistic regression with factors
modlogisticsmb=glm(fgood ~ cbinq + cbterm + inc, data = train, family = binomial())
summary(modlogisticsmb)

# Error in performance::check_outliers()
plot( performance::check_outliers(modlogisticsmb) )

Error: No numeric variables found. No data to check for outliers.

However, such an analysis can be carried out by calling the performance::check_model function:

performance::check_model(modlogisticsmb, check = c('outliers'), residual_type = 'normal')

b206ddd8-07e5-4e28-8483-96bdebdd9696

The only annoying thing is that in this case the graph appears only on the left side of the screen, and not on the entire screen.