XGBoost Error - Githubissues

chanjunkai11 commented 5 months ago

Describe the bug When I try running the XGboost R package generated by OHDSI Atlas patient-level prediction. It give this error:

The dataset that I'm using is generated from synthea which has a total of 40k patient in it.

My system info is shown as below:

To Reproduce

cd <- DatabaseConnector::createConnectionDetails(
     dbms     = "postgresql", 
     server   = "localhost/synthea10", 
     user     = "postgres", 
     password = "", 
     port     = 5432, 
     pathToDriver = "C:/Users/rider/Downloads"
)

cdmDatabaseSchema <- 'clonecdm'
cdmDatabaseName <- 'test'
cohortDatabaseSchema <- 'clonecdm'

tempEmulationSchema <- NULL
cohortTable <- 'cohort'

databaseDetails <- PatientLevelPrediction::createDatabaseDetails(
        connectionDetails = cd, 
        cdmDatabaseSchema = cdmDatabaseSchema, 
        cdmDatabaseName = cdmDatabaseName, 
        tempEmulationSchema = tempEmulationSchema, 
        cohortDatabaseSchema = cohortDatabaseSchema, 
        cohortTable = cohortTable, 
        outcomeDatabaseSchema = cohortDatabaseSchema,  
        outcomeTable = cohortTable, 
        cdmVersion = 5
)

logSettings <- PatientLevelPrediction::createLogSettings(
        verbosity = 'INFO', 
        logName = 'prediction'
)

createProtocol <- FALSE
createCohorts <- TRUE
runDiagnostic <- FALSE
viewDiagnostic <- FALSE
runAnalyses <- TRUE
sampleSize <- NULL 
createValidationPackage <- FALSE
analysesToValidate = NULL
packageResults <- FALSE
minCellCount= 5
createShiny <- FALSE

outputFolder <- 'C:/predictionResults'
library(reticulate)
use_python("C:/Users/rider/OneDrive/Documents/.virtualenvs/r-reticulate/Scripts/python.exe")

Daib::execute(
        databaseDetails = databaseDetails,
        outputFolder = outputFolder,
        createProtocol = createProtocol,
        createCohorts = createCohorts,
        runDiagnostic = runDiagnostic,
        viewDiagnostic = viewDiagnostic,
        runAnalyses = runAnalyses,
        createValidationPackage = createValidationPackage,
        analysesToValidate = analysesToValidate,
        packageResults = packageResults,
        minCellCount= minCellCount,
        logSettings = logSettings,
        sampleSize = sampleSize
)

The code above here shown are OHDSI Skeleton package to execute prediction model generated by OHDSI Atlas.

egillax commented 5 months ago

Hi, @chanjunkai11

Thank you for the report. This is an issue I've seen before but not been able to reproduce. I think this has something to do with when you have low number of outcomes. Could you by any chance share the log generated by PLP from a run where this failed? That could help me reproduce.

chanjunkai11 commented 5 months ago

Hi, @chanjunkai11

Thank you for the report. This is an issue I've seen before but not been able to reproduce. I think this has something to do with when you have low number of outcomes. Could you by any chance share the log generated by PLP from a run where this failed? That could help me reproduce.

Hello here is the log file you requested, please reach out to me if you need more information in reproducing the situation.

plpLog.txt

egillax commented 5 months ago

Hi @chanjunkai11 ,

This happens when you have really low number of outcomes. In your case I believe you have 24 outcomes. After splitting this means during cross validation you are using 12 outcomes for fitting and 6 for validating. Using early stopping in xgboost further needs an early stopping set, which is 10% of the training, giving ~ 1 outcome. Which breaks xgboost.

The question is what we should do in the case of such few outcomes. Currently PLP stops if it has less than 10 outcomes in total or less than 5 outcomes per fold. I personally believe we should increase that threshold at least an order of magnitude up (at least 100 outcomes) but that needs wider discussion.

In your case you are barely above the current threshold. If you still want a model you can turn of early stopping in your model settings and then it should finish.

chanjunkai11 commented 5 months ago

Hi @chanjunkai11 ,

This happens when you have really low number of outcomes. In your case I believe you have 24 outcomes. After splitting this means during cross validation you are using 12 outcomes for fitting and 6 for validating. Using early stopping in xgboost further needs an early stopping set, which is 10% of the training, giving ~ 1 outcome. Which breaks xgboost.

The question is what we should do in the case of such few outcomes. Currently PLP stops if it has less than 10 outcomes in total or less than 5 outcomes per fold. I personally believe we should increase that threshold at least an order of magnitude up (at least 100 outcomes) but that needs wider discussion.

In your case you are barely above the current threshold. If you still want a model you can turn of early stopping in your model settings and then it should finish.

If that is the case then why does the decision tree model does not have such issue using the same dataset?

egillax commented 5 months ago

Using early stopping in xgboost further needs an early stopping set, which is 10% of the training, giving ~ 1 outcome.

Because of this.

Even if the model fits with such few outcomes. The resulting model will be so bad it's basically useless.

jreps commented 5 months ago

@egillax - good detective work :). Shall we add a check in xgboost what warns if the number of outcomes is <100 or/and the number of outcomes in the early stop is < N (where N is the minimum xgboost supports) - that way the user will know what the error is. In terms of minimum outcomes for PLP, it used to be 100 but that caused some issues when people wanted to fit models with 98 outcomes, so I made it smallest possible for code not to fail (most of the time) and then the user needs to decide whether there are truly adequate outcomes. Happy to have a wider discussion and change that though if people think it is best.

egillax commented 5 months ago

@jreps Yes we can add such a warning. I think this whole discussion comes down to if we want to try to be opinionated on minimum required outcomes or try to make everything run without error and leave the rest up to the user.

In this case we could print a warning that #outcomes is to small for early stopping and turn it off to be able to fit the model. I think the error comes from the fact that if you only have 1 outcome you can't calculate any metric to early stop on.

Alternative we could catch the error and stop, print a more informative error message (# outcome to low for early stopping) and include suggestions on what the user can change to still fit a model.

OHDSI / PatientLevelPrediction

XGBoost Error #447