BavoDC / val.prob.ci.2

3 stars 2 forks source link

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL' #4

Open afenton85 opened 3 years ago

afenton85 commented 3 years ago

I am trying to plot a calibration plot. There are 329443 observations, and I enter the following:

val.prob.ci.2(Bare$KFRE_Pred, Bare$RRT_in_2YRS, smooth="rcs")

which leads to the following error:

singular information matrix in lrm.fit (rank= 3 ). Offending variable(s):

Warning in value[3L] : The number of knots led to estimation problems, nk will be set to 4. singular information matrix in lrm.fit (rank= 3 ). Offending variable(s):

Warning in value[3L] : Nk 4 also led to estimation problems, nk will be set to 3. singular information matrix in lrm.fit (rank= 2 ). Offending variable(s):

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL' In addition: Warning message: In val.prob.ci.2(Bare$KFRE_Pred, Bare$RRT_in_2YRS, smooth = "rcs") : 27 observations deleted from logistic calibration due to probs. of 0 or 1

Any suggestions?

Best, Anthony

BavoDC commented 3 years ago

Hi Anthony,

Based on your output I suspect that you have quasi-complete separation (27 observations deleted from logistic calibration due to probs. of 0 or 1) and that this causes problems with the function. Do some of your coefficients (and their standard errors) tend to infinity? If so, I would suggest to use Firth's bias reduction method to estimate your model (see the package brglm2 for example, this one also enables you to check if you have quasi-complete separation) and to use these predicted values. This is important since your current results are not valid due to the separation issue.

Best,

Bavo.

afenton85 commented 3 years ago

Thank you for your reply. This is an external validation of someone else's published model. I have tried deleting the 27 cases where the predicted risk is 0, but still get the following:

singular information matrix in lrm.fit (rank= 3 ). Offending variable(s):

Warning in value[3L] : The number of knots led to estimation problems, nk will be set to 4. singular information matrix in lrm.fit (rank= 3 ). Offending variable(s):

Warning in value[3L] : Nk 4 also led to estimation problems, nk will be set to 3. singular information matrix in lrm.fit (rank= 2 ). Offending variable(s):

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL'

BW, Anthony

BavoDC commented 3 years ago

Did you inspect the distribution of your predicted probabilities? If the majority of the predicted probabilities are situated near 0 or 1, I think I know what causes the problem. Also, are you using a machine learning algorithm? When you have 329 443 observations and the majority of your predicted probabilities are very close to 0 or 1, you should always be wary of the results and check if there is possible data leakage. If it's a logistic regression model, I would check the range of the data that you are using (assuming that the original model does not contain extremely large coefficients). In any case, errors like these indicate that there's something wrong with either the data or the model. We only get predicted probabilities like these if the coefficients, the covariate values or both are very large: binomial()$linkinv(10) binomial()$linkinv(-10)

I was able to reproduce the errors, but I explicitly had to mimic quasi-complete separation when generating the data. The reason why the function throws an error is because it is not able to (reliably) estimate the calibration curve using restricted cubic splines. The latter is due to the fact that the response is almost perfectly separated by the predicted values. So, even though the model almost perfectly separates both classes, its calibration is poor to say the least. Due to the fact that we only have either low or large predicted probabilities, we have a whole range of values where the model is not calibrated at all.

Below is the code that I used to reproduce the errors that you had. As you can see, we get the same error message and get non-sensical results. I hope that this answers your question and that you are able to solve it.

library(CalibrationCurves)
library(detectseparation)

data("MurderRates", package = "AER")
glmFit = glm(I(executions > 0) ~ time + income + noncauc + lfp + southern, data = MurderRates, family = binomial)
coef(glmFit) # Coefficient estimates tend to infinity
glm(I(executions > 0) ~ time + income + noncauc + lfp + southern, data = MurderRates, family = binomial,
    method = "detect_separation")

val.prob.ci.2(fitted(glmFit), as.numeric(I(MurderRates$executions > 0)), smooth = "rcs")
plot(fitted(glmFit), as.numeric(I(MurderRates$executions > 0)))

n = 1e4
p = 2
X      = replicate(p, rnorm(n))
p0true = binomial()$linkinv(cbind(1, X) %*% c(0.1, -50, 50))
y      = rbinom(n, 1, p0true)
Df     = data.frame(y, X)
glmFit = glm(y ~ ., data = Df, family = binomial)

n = 1e3
Xval   = replicate(p, rnorm(n))
p0true = binomial()$linkinv(cbind(1, Xval) %*% c(0.1, -50, 50))
yval   = rbinom(n, 1, p0true)
Pred   = binomial()$linkinv(cbind(1, Xval) %*% coef(glmFit))
plot(Pred, yval)
hist(Pred)
val.prob.ci.2(Pred, yval, smooth = "rcs")