BavoDC / CalibrationCurves

The package with the val.prob.ci.2 code, formerly called the val.prob.ci.2 package.
https://bavodc.github.io/CalibrationCurves/
21 stars 10 forks source link

CalibrationCurve with large dataset #7

Open reiniervlinschoten opened 8 months ago

reiniervlinschoten commented 8 months ago

I have a dataset with 610.000 observations, in which I have the observed outcome Y and a predicted probability of this outcome p. If I try to run a calibration curve on this development dataset with the following code, I get an error.

calibration_development <_ valProbggplot(
    development_df$predicted_probability,
    development_df$outcome,
    col.ideal = 'gray',
    dostats = FALSE,
    allowPerfectPredictions=T)

Output

Warning in valProbggplot(development_df$predicted_probability, development_df$outcome -   :
    Number of observations > 5000, RCS is recommended.
Error in if (n> 0) c(NA_integer_, -n) else integer() :
    missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In sqrt(diag(vcov(object)))   : NaNs produced
2: In rep.fac * nx: NAs produced by integer overflow
3: In .set_row_names(as.integer(prod(d)))   : 
    NAs introduced by coercion to integer range

If I subset my data to a maximum of around 20.000 rows, only the RCS recommendation appears, but the other errors disappear. However then there is the issue that the outcome is very rare and the confidence interval of the calibration curve is not correct as it does not use all my data.

I have a feeling it has to do with the size my dataset, but would you have a solution?

BavoDC commented 8 months ago

Hi Reinier,

Thank you submitting the issue and sorry for the late reply. It's currently a busy period, so unfortunately I do not have a lot of time the coming month.

Are you allowed to share the data? This would help in debugging. Have you tried plotting the flexible calibration curve?

I also see that you allow for perfect predictions. Please be aware that also this will only give you an approximation of the performance, since the original values are replaced by either 1e-8 or 1 - 1e-8. How many of your predicted probabilities are equal to 0 or 1? What's the prevalence in your data set?

I do agree that you should use all of your data. With the subset, you barely use 3% of your data so you cannot base any conclusions on this.

Please contact me again in the beginning of February if you haven't solved the problem yet. Then I will have time to look at it in detail.

reiniervlinschoten commented 8 months ago

Are you allowed to share the data? This would help in debugging.

No, unfortunately the data is very sensitive and we are under no circumstance allowed to share it.

I also see that you allow for perfect predictions. Please be aware that also this will only give you an approximation of the performance, since the original values are replaced by either 1e-8 or 1 - 1e-8. How many of your predicted probabilities are equal to 0 or 1?

Yes, I know that this is an issue. This is due to perfect separation due to some variables in the base model (17 predicted probabilities of 1). After recoding that variable and running the model again only 1 record has a predicted probability of 1. I will use that model in the rest of the steps.

What's the prevalence in your data set? The dataset has 606,918 records, with 3,874 outcomes, for a prevalence of 0.6%.

Have you tried plotting the flexible calibration curve?

This plots the flexible calibration curve right?

calibration_development <- valProbggplot(
    development_df$predicted_probability,
    development_df$outcome,
    col.ideal = 'gray',
    dostats = FALSE,
    allowPerfectPredictions=T)

I also tried without plotting (pl = F), restricted cubic splines (smooth="rcs"), no flexible curve (smooth="none"), and bootstrapped confidence interval (CL.BT=T), . These all give the same error.

Error in if (n> 0) c(NA_integer_, -n) else integer() :
    missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx: NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d)))   : 
    NAs introduced by coercion to integer range

The line on a problem with the voc matrix (1: In sqrt(diag(vcov(object))) : NaNs produced) seems to have disappeared (both for the base model and the updated model I have used in these steps). I am not certain why.

RRiviere15 commented 5 months ago

Hi Reinier, Did you manage to come to a solution for this? I have almost the exactly same problem with my dataset of around 60,000 observations. I used val.prob.ci.2 on a subset of my dataset <30,000 and it works perfectly but on the full dataset I get:

Error: Warning in CalibrationCurves::val.prob.ci.2(p = predicted_probabilities, :

Number of observations > 5000, RCS is recommended.

Error in double(N * M1) : vector size cannot be NA

In addition: Warning message:

In N * M1 : NAs produced by integer overflow

Have rechecked many aspects of the data. There are no missing data. All data frames and matrices are of appropriate size but there seems to be no way around it.

It's a shame because I wanted to use this package because the graphs are nice but if it is not possible with larger datasets I have to resort back to the rms package which works just fine with the entire dataset but doesn't generate the confidence intervals..

BavoDC commented 5 months ago

Hi,

One of the reasons could be that the Loess functions results in an error due tot the size of the data set. Can you try setting the smooth argument to 'rcs'? For example, val.prob.ci.2(predicted, outcome, smooth = "rcs"). This has worked for other people in the past.

Best,

Bavo.

reiniervlinschoten commented 5 months ago

Hi all,

No, was not able to fix it unfortunately. I calculated the calibration intercept and slope, with corresponding CIs myself. It's quite easy actually.

# First calculate logits
dataset$prob <- predict(model, data=dataset, type="response")
dataset$logit <- qlogis(dataset$prob)

# Calculate slope
cal_slope <- glm(outcome ~ logit, data=dataset, family=binomial)
# Get slope
slope <- coef(cal_slope)["logit"]
# Get conf interval
slope_ci <- confint(cal_slope)["logit", ]

# Do the same for intercept, act as if slope is perfect
# Can´t find the citation anymore for why, maybe @David-van-Klaveren or @BavoDC can find it
cal_intercept <- glm(outcome ~ offset(logit), data=dataset, family=binomial)
# Get slope
intercept <- coef(cal_slope)["(Intercept)"]
# Get conf interval
intercept_ci <- confint(cal_slope)["(Intercept)", ]
RRiviere15 commented 5 months ago

Hi Bavo,

I tried to modify it with smooth = rcs as you recommended and got this error message:

CalibrationCurves::val.prob.ci.2(p = predicted_probabilities, y = PUF21CCA$delirium2, smooth = "rcs") singular information matrix in lrm.fit (rank= 4 ). Offending variable(s): Warning in value[3L] : The number of knots led to estimation problems, nk will be set to nr.knots Error in nk - 1 : non-numeric argument to binary operator

Thanks for your recommendation Reinier - will have a shot at it.

RRiviere15 commented 5 months ago

Hi Renier,

That's brilliant. What an elegant solution. I trust these numbers because they are similar to what I am able to obtain with val.prob.ci.2 with half the dataset, but even smaller confidence intervals since now I am using the entire data. Very good!

reiniervlinschoten commented 5 months ago

Great! C-statistic (also called AUC) with CI can be done with the pROC package

Op di 12 mrt 2024 21:17 schreef RRiviere15 @.***>:

Hi Renier,

That's brilliant. What an elegant solution. Do you have any code suggestions to calculate the CI for the C-statistic? It would be immensely helpful. I trust these numbers because they are similar to what I am able to obtain with val.prob.ci.2 with half the dataset, but even smaller confidence intervals since now I am using the entire data. Very good!

— Reply to this email directly, view it on GitHub https://github.com/BavoDC/CalibrationCurves/issues/7#issuecomment-1992505968, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSQFLIT7BSO4ZG2DW234DTYX5PHNAVCNFSM6AAAAABBGMPJ4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGUYDKOJWHA . You are receiving this because you authored the thread.Message ID: @.***>

BavoDC commented 5 months ago

Hi Raphaël,

The error you get was due to an error in the previous version of the package. If you update the package, you should no longer see this error and it should estimate the calibration curve using RCS.

Renier's code to calculate the calibration intercept and slope is correct. See also https://github.com/BavoDC/CalibrationCurves/issues/3.

You can find the necessary references using the code citation("CalibrationCurves").

BavoDC commented 5 months ago

Hi Reinier, Raphaël,

Could you help me with debugging by running the following block of code with your data sets? y is the vector with the values for the outcome and p is the vector with the predicted probabilities. Could you paste the errors that you get and where. This would be extremely helpful and with this, you would also help future users.

y                                          # Outcome variable
p                                          # Predicted probabilities
Eta = binomial()$linkfun(p)   # Linear predictor

## Logistic calibration curve ##
glmFit = glm(y ~ Eta, family = binomial)

## Flexible calibration curve ##

### Loess ###
argzLoess = alist(degree = 2, formula = y ~ p)
SmFit = do.call("loess", argzLoess)
predict(SmFit, type = "fitted", se = TRUE)

### RCS ###
argzRCS = alist(x = p,
                y = y,
                model = "logistic",
                nk = 5,
                show = "prob",
                statloc = "none",
                plot = FALSE,
                showknots = FALSE,
                xrange = c(min(na.omit(p)), max(na.omit(p))),
                lty = 1)

.rcspline.plot <<- CalibrationCurves:::.rcspline.plot

nkDecrease <- function(Argz) {
  tryCatch(
    do.call(".rcspline.plot", Argz),
    error = function(e) {
      nk = eval(Argz$nk)
      warning(paste0("The number of knots led to estimation problems, nk will be set to ", nk - 1), immediate. = TRUE)
      if(nk < 3)
        stop("Nk = 3 led to estimation problems.")
      Argz$nk = nk - 1
      nkDecrease(Argz)
    }
  )
}
rcsFit = nkDecrease(argzRCS)

Thank you also for reporting the issue. This helps me to further improve the package!

Have a nice day,

Bavo.

reiniervlinschoten commented 5 months ago

I will try, hopefully next weekend!

RRiviere15 commented 5 months ago

Hi Raphaël,

The error you get was due to an error in the previous version of the package. If you update the package, you should no longer see this error and it should estimate the calibration curve using RCS.

Renier's code to calculate the calibration intercept and slope is correct. See also #3.

You can find the necessary references using the code citation("CalibrationCurves").

Hi Bavo, I made sure the package was updated like you suggested, and tried RCS instead of LOESS and got this:

Error in CalibrationCurves::val.prob.ci.2(p = predicted_probabilities, : lengths of p or logit and y do not agree In addition: Warning message: In log(p/(1 - p)) : NaNs produced

I checked and there were no predicted_probabilities values that were either 0 or 1 that could result in a dysfunctional log() solution. I will go ahead and try to implement the new code you've suggested. Cheers, RR

RRiviere15 commented 5 months ago

Hello, So I will be making separate posts. The first is addressing the 'loess' part of the code. I made the linear predictor into a separate variable Eta like you suggested; normally I just put all the predictors instead of creating a linear predictor and then passing it to glm():

Trying out Bavo's Code

LRModel_CCA1_CV <- glm(delirium2 ~ agenum + asa2 + cogimpair2 + infection2 + optime + casetype + anestype3, data = PUF21CCA, family = "binomial", x = TRUE, y = TRUE)

coefficients <- coef(LRModel_CCA1_CV) predictor_matrix <- model.matrix(~ agenum + asa2 + cogimpair2 + infection2 + optime + casetype + anestype3, data = PUF21CCA) Eta <- predictor_matrix %*% coefficients

Data pre-processed so can now execute

glmFit <- glm(delirium2 ~ Eta, data = PUF21CCA)

predicted_probabilities <- predict(glmFit, newdata = PUF21CCA, type = "response") #Trying on the same dataset; not 'new'

argzLoess = alist(degree = 2, formula = PUF21CCA$delirium2 ~ predicted_probabilities) SmFit = do.call("loess", argzLoess) predict(SmFit, type = "fitted", se = TRUE)

And our old friend returns; I did it the usual way as well without creating 'Eta' and just directly listing the predictors and it still gave me the same message; tried changing type = "fitted" to type' = "response", and still same message..: Error in double(N M1) : vector size cannot be NA In addition: Warning message: In N M1 : NAs produced by integer overflow

RRiviere15 commented 5 months ago

I think I figured out the problem. In the last line of code if I change the se setting to :

predict(SmFit, type = "fitted", se = FALSE) It executes the code and displays all the predicted values. So the problem is with calculating the standard errors.

RRiviere15 commented 5 months ago

With regards to the RCS component of the code, it processed the code all the way to the end and finally gave this error:

Warning in value[3L] : The number of knots led to estimation problems, nk will be set to 4 Warning in value[3L] : The number of knots led to estimation problems, nk will be set to 3 Warning in value[3L] : The number of knots led to estimation problems, nk will be set to 2 Warning in value[3L] : The number of knots led to estimation problems, nk will be set to 1 Error in value[3L] : Nk = 3 led to estimation problems. Called from: value[3L]

BavoDC commented 5 months ago

Many thanks Raphaël!! This helps me to pinpoint the problematic parts in the code.

As you noticed, the main problem with loess is the calculation of the standard errors which is necessary for calculating the confidence intervals. You can either adjust the arguments for loess (span or degree) or you can try using the lowess function, but I am afraid this will also result in errors.

Similar to loess, estimating the calibration curve with RCS also results in problems.

Estimating the calibration curve with a non-parametric smoother (e.g., loess or RCS) seems particularly problematic with large data sets. Hence, we are left with the logistic calibration framework to estimate the calibration curve. For example, with the following code

val.prob.ci.2(p, y, smooth = "none", logistic.cal = TRUE)
valProbggplot(p, y, smooth = "none", logistic.cal = TRUE)

Given the size of the data set and the aforementioned problems, this is the most appropriate (and only) method to estimate the calibration curve, given this data set, and to assess the calibration of your prediction model. When writing this down in a paper, do mention the fact that you were not able to fit a non-parametric smoother.

I will, of course, further investigate the problems that you experience and try to tackle these issues. I can also reproduce the error, which will help me to further debug the code. Thanks again and don't hesitate to reach out in case of any questions!

BavoDC commented 5 months ago

One additional remark. In this situation, we cannot estimate the standard errors using loess, but we are able to estimate the flexible calibration curve itself. I would compare the flexible calibration curve with the logistic calibration curve. If these are in line with each other, this gives you additional certainty about the validity of the logistic calibration curve. If not, I would rely more on the flexible calibration curve. With this large data set, the confidence intervals will tend to be very narrow anyway (do check the distribution of the predicted values though, to see where the uncertainty will be larger) so you would not gain a lot of extra information from plotting them.

RRiviere15 commented 5 months ago

Hi Bavo, thank you for your very insightful comments. I read your 2023 paper about calibration where you specifically mention the 'logistic calibration framework' which has been helpful in understanding this all. Harrell's val.prob uses the same logistic calibration framework to generate the calibration slope and intercept, would you say?

You are correct; I was able to generate loess curves but I believe the calibration statistics themselves were generated by the logistic calibration framework with rms :: val.prob() which allows one to plot both the logistic calibration and loess curves on the same plot. This has been extremely useful actually when comparing on of my LASSO logistic regression models with an unpenalized logistic regression methods where there is a big discrepancy between the LOESS and logistic curves for the LASSO model.

I will check the distribution of the predicted values as you recommend.

Do you have any recommendations on how to perform a calibration for a random forest algorithm derived model? My RF model generated a lot of perfect predictions (probability 0 or 1) and val.prob simply deleted these when plotting the LOESS and associated calibration statistics so the calibration parameters are quite poor.

I hope your de-bugging journey goes well! Please do keep me updated. Even if I can't use it for my dissertation by May/June it would be cool to use for the publication.

BavoDC commented 5 months ago

Hi Raphaël,

Happy to hear that the paper has been helpful in understanding!

Harrell's val.prob does indeed use the same logistic calibration framework, but there is a difference in how the calibration intercept is computed (see https://bavodc.github.io/websiteCalibrationCurves/articles/CalibrationCurves.html#why-is-the-calibration-intercept-different-in-the-rms-package ). My paper is based on previous papers of Harrell, so definitely do check the references if you need some more information.

If needed, you can also plot both the loess and logistic calibration curves using the CalibrationCurves packages. For example, val.prob.ci.2(p, y, logistic.cal = TRUE, col.log = "purple").

What's the proportion of perfect predictions (this shouldn't be too high)? Do any of the leaf nodes contain observations with only 0 or 1? If you are certain that there are no model problems (no overfitting, model converged, ...), you can replace 0 and 1 by 1e-8 and 1 - 1e-8, respectively. This, however, is far from ideal and will only give you an approximation of the model performance. Please keep in mind that this specific situation is indicative of the model fitting random noise and that perfect predictions imply that the values are deterministic.

Many thanks and will keep you updated. Good luck with finishing your dissertation!

RRiviere15 commented 5 months ago

Hi Bavo, It was about 10% of the data that had perfect predictions. I suppose I will simply have to accept it as a limitation as I have no other easy way to get a good idea of the calibration - it definitely is overfit and is probably a good discussion point when discussing the strengths and weaknesses. Thank you for the suggestion RE 1e-8; that did the trick. Looking quite forward to the standard error fix so I can use it on the entire dataset!

fredho-42 commented 2 weeks ago

Sorry to resurrecting this but I encountered a similar problem with n=367,225.

> calPerf1 <- with(ds1 |> filter(MH==1), val.prob.ci.2(UKrisk, event_10y, smooth = "none", logistic.cal = TRUE))
> calPerf0 <- with(ds1 |> filter(MH==0), val.prob.ci.2(UKrisk, event_10y, smooth = "none", logistic.cal = TRUE))
Error in if (n > 0) c(NA_integer_, -n) else integer() : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx : NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d))) :
  NAs introduced by coercion to integer range

ds1 is a data-frame with about 400,000 rows, and MH is a binary variable. About n=4000 had MH==1, where the function executed without a problem.

Summary for the event_10y and UKrisk variables are below:

> table(ds1$MH, ds1$event_10y, useNA = 'ifany')
         0      1
  0 351425  15800
  1  32413   1815

> summary(ds1$UKrisk)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002192 0.024064 0.041309 0.045954 0.062822 0.263547 

I've already specified no smoothing so i wondered if there's other ways that could try to make this work? Thanks.