OHDSI / PatientLevelPrediction

An R package for performing patient level prediction in an observational database in the OMOP Common Data Model.
https://ohdsi.github.io/PatientLevelPrediction
188 stars 89 forks source link

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata #40

Closed JamesSWiggins closed 6 years ago

JamesSWiggins commented 6 years ago

I receive the below error,

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata

when running the following code:

modelSettings <- PatientLevelPrediction::setNaiveBayes()

results <- PatientLevelPrediction::runPlp(population = population,
                                         plpData = plpData, 
                                         modelSettings = modelSettings, 
                                         testSplit = 'time',
                                         testFraction = 0.25, 
                                         nfold = 3)

I receive the same error even if I use different statistical models. I've tried to trace the error and it seems to occur when the follow code is run:

From predictPlp:

....
if (class(plpModel) == "plpModel") {
        prediction <- plpModel$predict(plpData = plpData, population = population[ind, 
            ])

Could this be the result of an issue with my input data? Or perhaps my environment?

JamesSWiggins commented 6 years ago

Digging a little further. I believe this error is occurring because my "deletedInfrequentCovariateIds <- preprocessSettings$deletedInfrequentCovariateIds" is NULL.

That value gets passed as 'table' and runs in the below function: ffmatch(x = x, table = as.ff(table), nomatch = 0L) > 0L

And ffmatch contains this code: stopifnot(inherits(x, "ff_vector") & inherits(table, "ff_vector"))

Which results in the original message I posted: Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata

If it's relevant, here are my covariate settings:


                                                                useDemographicsAge = TRUE,
                                                                useDemographicsAgeGroup = TRUE,
                                                                useDemographicsRace = TRUE,
                                                                useDemographicsEthnicity = TRUE,
                                                                useDemographicsIndexYear = TRUE,
                                                                useDemographicsIndexMonth = TRUE,
                                                                useDemographicsPriorObservationTime = FALSE,
                                                                useDemographicsPostObservationTime = FALSE,
                                                                useDemographicsTimeInCohort = FALSE,
                                                                useDemographicsIndexYearMonth = FALSE,
                                                                useConditionOccurrenceAnyTimePrior = FALSE,
                                                                useConditionOccurrenceLongTerm = FALSE,
                                                                useConditionOccurrenceMediumTerm = FALSE,
                                                                useConditionOccurrenceShortTerm = FALSE,
                                                                useConditionOccurrenceInpatientAnyTimePrior = FALSE,
                                                                useConditionOccurrenceInpatientLongTerm = FALSE,
                                                                useConditionOccurrenceInpatientMediumTerm = FALSE,
                                                                useConditionOccurrenceInpatientShortTerm = FALSE,
                                                                useConditionEraAnyTimePrior = FALSE,
                                                                useConditionEraLongTerm = FALSE,
                                                                useConditionEraMediumTerm = FALSE,
                                                                useConditionEraShortTerm = FALSE,
                                                                useConditionEraOverlapping = FALSE,
                                                                useConditionEraStartLongTerm = FALSE,
                                                                useConditionEraStartMediumTerm = FALSE,
                                                                useConditionEraStartShortTerm = FALSE,
                                                                useConditionGroupEraAnyTimePrior = FALSE,
                                                                useConditionGroupEraLongTerm = TRUE,
                                                                useConditionGroupEraMediumTerm = TRUE,
                                                                useConditionGroupEraShortTerm = TRUE,
                                                                useConditionGroupEraOverlapping = FALSE,
                                                                useConditionGroupEraStartLongTerm = FALSE,
                                                                useConditionGroupEraStartMediumTerm = FALSE,
                                                                useConditionGroupEraStartShortTerm = FALSE,
                                                                useDrugExposureAnyTimePrior = TRUE,
                                                                useDrugExposureLongTerm = TRUE,
                                                                useDrugExposureMediumTerm = TRUE,
                                                                useDrugExposureShortTerm = TRUE,
                                                                useDrugEraAnyTimePrior = TRUE,
                                                                useDrugEraLongTerm = TRUE,
                                                                useDrugEraMediumTerm = TRUE,
                                                                useDrugEraShortTerm = TRUE,
                                                                useDrugEraOverlapping = TRUE,
                                                                useDrugEraStartLongTerm = TRUE,
                                                                useDrugEraStartMediumTerm = TRUE,
                                                                useDrugEraStartShortTerm = TRUE,
                                                                useDrugGroupEraAnyTimePrior = TRUE,
                                                                useDrugGroupEraLongTerm = TRUE,
                                                                useDrugGroupEraMediumTerm = TRUE,
                                                                useDrugGroupEraShortTerm = TRUE,
                                                                useDrugGroupEraOverlapping = TRUE,
                                                                useDrugGroupEraStartLongTerm = TRUE,
                                                                useDrugGroupEraStartMediumTerm = TRUE,
                                                                useDrugGroupEraStartShortTerm = TRUE,
                                                                useProcedureOccurrenceAnyTimePrior = FALSE,
                                                                useProcedureOccurrenceLongTerm = TRUE,
                                                                useProcedureOccurrenceMediumTerm = TRUE,
                                                                useProcedureOccurrenceShortTerm = TRUE,
                                                                useDeviceExposureAnyTimePrior = FALSE,
                                                                useDeviceExposureLongTerm = TRUE,
                                                                useDeviceExposureMediumTerm = TRUE,
                                                                useDeviceExposureShortTerm = TRUE,
                                                                useMeasurementAnyTimePrior = FALSE,
                                                                useMeasurementLongTerm = TRUE,
                                                                useMeasurementMediumTerm = TRUE,
                                                                useMeasurementShortTerm = TRUE,
                                                                useMeasurementValueAnyTimePrior = FALSE,
                                                                useMeasurementValueLongTerm = FALSE,
                                                                useMeasurementValueMediumTerm = FALSE,
                                                                useMeasurementValueShortTerm = FALSE,
                                                                useMeasurementRangeGroupAnyTimePrior = FALSE,
                                                                useMeasurementRangeGroupLongTerm = TRUE,
                                                                useMeasurementRangeGroupMediumTerm = FALSE,
                                                                useMeasurementRangeGroupShortTerm = FALSE,
                                                                useObservationAnyTimePrior = FALSE,
                                                                useObservationLongTerm = TRUE,
                                                                useObservationMediumTerm = TRUE,
                                                                useObservationShortTerm = TRUE,
                                                                useCharlsonIndex = TRUE,
                                                                useDcsi = TRUE,
                                                                useChads2 = TRUE,
                                                                useChads2Vasc = TRUE,
                                                                useDistinctConditionCountLongTerm = FALSE,
                                                                useDistinctConditionCountMediumTerm = FALSE,
                                                                useDistinctConditionCountShortTerm = FALSE,
                                                                useDistinctIngredientCountLongTerm = FALSE,
                                                                useDistinctIngredientCountMediumTerm = FALSE,
                                                                useDistinctIngredientCountShortTerm = FALSE,
                                                                useDistinctProcedureCountLongTerm = FALSE,
                                                                useDistinctProcedureCountMediumTerm = FALSE,
                                                                useDistinctProcedureCountShortTerm = FALSE,
                                                                useDistinctMeasurementCountLongTerm = FALSE,
                                                                useDistinctMeasurementCountMediumTerm = FALSE,
                                                                useDistinctMeasurementCountShortTerm = FALSE,
                                                                useDistinctObservationCountLongTerm = FALSE,
                                                                useDistinctObservationCountMediumTerm = FALSE,
                                                                useDistinctObservationCountShortTerm = FALSE,
                                                                useVisitCountLongTerm = FALSE,
                                                                useVisitCountMediumTerm = FALSE,
                                                                useVisitCountShortTerm = FALSE,
                                                                useVisitConceptCountLongTerm = FALSE,
                                                                useVisitConceptCountMediumTerm = FALSE,
                                                                useVisitConceptCountShortTerm = FALSE,
                                                                longTermStartDays = 365,
                                                                mediumTermStartDays = 180,
                                                                shortTermStartDays = 30,
                                                                endDays = 0,
                                                                includedCovariateConceptIds = c(),
                                                                addDescendantsToInclude = FALSE,
                                                                excludedCovariateConceptIds = c(),
                                                                addDescendantsToExclude = FALSE,
                                                                includedCovariateIds = c())```
jreps commented 6 years ago

Hi James, Thanks for letting us know about this issue. I had a look at the code and there seems to be checks for the NULL using length(), so I think it might be a different issue. Do you run getPlpData() to get the data and how much space do you have on the drive you're using to run the code? Also, what version of PatientLevelPrediction are you using?

JamesSWiggins commented 6 years ago

Hi, Thanks for your reply. Yes, I run the getPlpData() statement below:

plpData <- PatientLevelPrediction::getPlpData(connectionDetails = connectionDetails,
                                              cdmDatabaseSchema = cdmDatabaseSchema,
                                              cohortId = targetCohortId,
                                              outcomeIds = outcomeList,
                                              studyStartDate = "",
                                              studyEndDate = "",
                                              cohortDatabaseSchema = cohortsDatabaseSchema,
                                              cohortTable = cohortTable,
                                              outcomeDatabaseSchema = cohortsDatabaseSchema,
                                              outcomeTable = outcomeTable,
                                              cdmVersion = cdmVersion,
                                              firstExposureOnly = FALSE,
                                              washoutPeriod = 0,
                                              sampleSize = 100000,
                                              covariateSettings = covariateSettings)

There is about 93GB of free space on the drive that contains my home directory:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G   72K   16G   1% /dev
tmpfs            16G     0   16G   0% /dev/shm
/dev/xvda1       99G  5.8G   93G   6% /
/dev/xvdb1      5.0G   52M  5.0G   2% /emr
/dev/xvdb2       95G  589M   95G   1% /mnt

Below is my sessionInfo(). Looks like I'm using PatientLevelPredition 2.0.0:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2017.09

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] shiny_1.0.5                  PatientLevelPrediction_2.0.0 Cyclops_1.3.1                FeatureExtraction_2.1.1     
[5] DatabaseConnector_2.0.5     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16         bindr_0.1.1          compiler_3.4.1       pillar_1.2.1         futile.logger_1.4.3  plyr_1.8.4          
 [7] futile.options_1.0.0 tools_3.4.1          digest_0.6.15        bit_1.1-12           viridisLite_0.3.0    jsonlite_1.5        
[13] tibble_1.4.2         gtable_0.2.0         lattice_0.20-35      ff_2.2-13            pkgconfig_2.0.1      rlang_0.2.0         
[19] Matrix_1.2-10        fastmatch_1.1-0      bindrcpp_0.2         rJava_0.9-9          httr_1.3.1           dplyr_0.7.4         
[25] htmlwidgets_1.0      grid_3.4.1           glue_1.2.0           data.table_1.10.4-3  ffbase_0.12.3        R6_2.2.2            
[31] plotly_4.7.1         survival_2.41-3      tidyr_0.8.0          purrr_0.2.4          magrittr_1.5         ggplot2_2.2.1       
[37] SqlRender_1.4.8      lambda.r_1.2         htmltools_0.3.6      scales_0.5.0         splines_3.4.1        assertthat_0.2.0    
[43] xtable_1.8-2         mime_0.5             colorspace_1.3-2     httpuv_1.3.6.2       RcppParallel_4.4.0   lazyeval_0.2.1      
[49] munsell_0.4.3       
jreps commented 6 years ago

Thanks for the extra info. If you installed PatientLevelPrediction 2 months ago or longer can you try installing the latest master version and see whether that fixes things (plus it has a lot more functions to make validation easier now)? I added the length() check about 3 months ago I think, so hopefully you just have the code that wasn't checking for NULL. If that doesn't solve things (or you only installed PatientLevelPrediction a few days ago), then is it possible to send me the log file created when you do runPlp()?

JamesSWiggins commented 6 years ago

Hmm... Should I be installing it from a source other than DRAT? Below is a log on my installation. I just installed it a few days ago and it looks like I'm pulling 2.0.0:

> drat::addRepo("OHDSI")
> install.packages("PatientLevelPrediction")
Installing package into ‘/home/wigginjs/R/x86_64-redhat-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://OHDSI.github.io/drat/src/contrib/PatientLevelPrediction_2.0.0.tar.gz'
jreps commented 6 years ago

Sorry, I mean the log that is created when you do runPlp() - it is called plplog.txt. That way I can see where it is failing. If you type length(NULL) does it return 0 for you on linux?

JamesSWiggins commented 6 years ago

Hi, Thank you for your help.

Yes, length(NULL) does return 0. See following:

> length(NULL)
[1] 0

Also, here is the content of plplog.txt:

************************************************************************************************
Patient-Level Prediction Package version 2.0.0
************************************************************************************************
AnalysisID:         20180324150554
CohortID:           9
OutcomeID:          7
Cohort size:        401
Covariates:         4065
Population size:    401
Cases:              57
************************************************************************************************
Creating 25% test and 75% train (into 3 folds) stratified split at 2008-06-10
Data split into 100 test cases and 301 train samples (101, 101, 99)
************************************************************************************************
Training Lasso Logistic Regression model
Removing infrequent covariates
Removing infrequent covariates took 0.00568 secs
Normalizing covariates
Normalizing covariates took 0.0544 secs
Removing redundant covariates
Removing redundant covariates took 0.0967 secs
Model saved to ..\20180324150554\savedModel
************************************************************************************************
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata
JamesSWiggins commented 6 years ago

Something that may be related is the actual Python binary that is being called by Plp. I have Anaconda 2 installed and reference it first in my PATH environment variable as shown following:

> Sys.setenv(PATH=":/usr/anaconda2/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin")
> 
> system("python --version")
Python 2.7.14 :: Anaconda, Inc.

But when I run PatientLevelPrediction::checkPlpInstallation() I get the following results:

.......
************************************************************************************************
************************************************************************************************
Calculating covariate summary @ 2018-03-24 16:54:14
This can take a while...
Finished covariate summary @ 2018-03-24 16:54:21
Log saved to /home/wigginjs/output/output/output/output/output/plpmodels/20180324165359/plplog.txt
Run finished successfully.
- Ok

Initialize Python Version 2.7.13 (default, Jan 31 2018, 00:17:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
......

Does Plp use the OS to find the Python binary? Or is the path to it determined in another way?

jreps commented 6 years ago

Hi James,

I think the issue is that the covariates are becoming empty, I've added checks for this now in these two functions:

applyTidyCovariateData <- function(plpData,preprocessSettings){

  covariates <- plpData$covariates

  maxs <- preprocessSettings$normFactors
  deleteCovariateIds <- preprocessSettings$deletedRedundantCovariateIds
  deletedInfrequentCovariateIds <- preprocessSettings$deletedInfrequentCovariateIds

  writeLines("Removing infrequent covariates")
  start <- Sys.time()
  if (length(deletedInfrequentCovariateIds) != 0) {
    idx <- !ffbase::`%in%`(covariates$covariateId, deletedInfrequentCovariateIds)
    if(sum(idx)>0){
      covariates <- covariates[idx, ]
    } else{
      stop('No covariates left')
    }
  }
  delta <- Sys.time() - start
  writeLines(paste("Removing infrequent covariates took", signif(delta, 3), attr(delta, "units")))

  writeLines("Normalizing covariates")
  start <- Sys.time()
  ffdfMaxs <- ff::as.ffdf(maxs)
  names(ffdfMaxs)[names(ffdfMaxs) == "bins"] <- "covariateId"
  covariates <- ffbase::merge.ffdf(covariates, ffdfMaxs)
  for (i in bit::chunk(covariates)) {
    covariates$covariateValue[i] <- covariates$covariateValue[i]/covariates$maxs[i]
  }
  covariates$maxs <- NULL
  delta <- Sys.time() - start
  writeLines(paste("Normalizing covariates took", signif(delta, 3), attr(delta, "units")))

  writeLines("Removing redundant covariates")
  start <- Sys.time()
  if (length(deleteCovariateIds) != 0) {
    idx <- !ffbase::`%in%`(covariates$covariateId, deleteCovariateIds)
    if(sum(idx)>0){
      covariates <- covariates[idx, ]
    } else{
      stop('No covariates left')
    }
  }
  delta <- Sys.time() - start
  writeLines(paste("Removing redundant covariates took", signif(delta, 3), attr(delta, "units")))

  plpData$covariates <- covariates

  return(plpData)
}

and

limitCovariatesToPopulation <- function(covariates, rowIds) {
  idx <- !is.na(ffbase::ffmatch(covariates$rowId, rowIds))
  if(sum(idx)!=0){
    covariates <- covariates[ffbase::ffwhich(idx, idx == TRUE), ]
  }else{
    stop('No covariates')
  }
  return(covariates)
}

try adding these functions to your session and than doing the runPlp as it should then use these updated versions and will then tell you if the covariates are being empty.

The python issue is a problem, but isn't causing the current issue. On windows the R package we use to connect to python can detect the python version we want, but on linux is will use the default python you have set up (e.g., the python you get when typing python in the terminal). You need to configure the anaconda to be the default python (having that first in the path should work) and then restart R for the python code to run.

JamesSWiggins commented 6 years ago

Thanks. I hope I did this right. I pasted in your new function code using the following R commands:

fixInNamespace("applyTidyCovariateData", ns = "PatientLevelPrediction")
fixInNamespace("limitCovariatesToPopulation", ns = "PatientLevelPrediction")

And below is the result from plplog.txt:

************************************************************************************************
Patient-Level Prediction Package version 2.0.0
************************************************************************************************
AnalysisID:         20180327125619
CohortID:           9
OutcomeID:          7
Cohort size:        401
Covariates:         4065
Population size:    401
Cases:              57
************************************************************************************************
Creating 25% test and 75% train (into 3 folds) stratified split at 2008-06-10
Data split into 100 test cases and 301 train samples (101, 101, 99)
************************************************************************************************
Training Lasso Logistic Regression model
Removing infrequent covariates
Removing infrequent covariates took 0.0056 secs
Normalizing covariates
Normalizing covariates took 0.0275 secs
Removing redundant covariates
Removing redundant covariates took 0.035 secs
No non-zero coefficients
No non-zero coefficients
Model saved to ..\20180327125619\savedModel
************************************************************************************************
simpleWarning in predictFfdf(predictiveModel$coefficients, population, covariates, : Model had no non-zero coefficients so predicted same for all population...

Also, to the console: Error: $ operator is invalid for atomic vectors

Does this indicate a problem with my input data?

jreps commented 6 years ago

Hi James,

Thanks for running the updated code. It looks like the code fixed the issue, but now that it has run the logistic regression didn't pick any variables into the model (the "No non-zero coefficient" output). In my experience this normally means the data are not able to predict the outcome well. If you try a gradient boosting machine instead of the logistic regression then you should get a model but the performance will probably be poor.

Also, if you get the default python to be the anaconda then you can also try the random forest or neural network models that use a python back end.

Please let me know if you have any other issues :)

Best wishes, Jenna

JamesSWiggins commented 6 years ago

Hi Jenna,

Hah, yes, you are correct. GradientBoostingMachine did complete successfully. Are you going to commit your code changes? Will I be able to pull code with them through DRAT?

Thank you so much for your help!!

jreps commented 6 years ago

Hi James,

Great, I added the edits and the new version with the fix should be 2.0.1.

Best wishes, Jenna