Closed JamesSWiggins closed 6 years ago
Digging a little further. I believe this error is occurring because my "deletedInfrequentCovariateIds <- preprocessSettings$deletedInfrequentCovariateIds" is NULL.
That value gets passed as 'table' and runs in the below function:
ffmatch(x = x, table = as.ff(table), nomatch = 0L) > 0L
And ffmatch contains this code:
stopifnot(inherits(x, "ff_vector") & inherits(table, "ff_vector"))
Which results in the original message I posted:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata
If it's relevant, here are my covariate settings:
useDemographicsAge = TRUE,
useDemographicsAgeGroup = TRUE,
useDemographicsRace = TRUE,
useDemographicsEthnicity = TRUE,
useDemographicsIndexYear = TRUE,
useDemographicsIndexMonth = TRUE,
useDemographicsPriorObservationTime = FALSE,
useDemographicsPostObservationTime = FALSE,
useDemographicsTimeInCohort = FALSE,
useDemographicsIndexYearMonth = FALSE,
useConditionOccurrenceAnyTimePrior = FALSE,
useConditionOccurrenceLongTerm = FALSE,
useConditionOccurrenceMediumTerm = FALSE,
useConditionOccurrenceShortTerm = FALSE,
useConditionOccurrenceInpatientAnyTimePrior = FALSE,
useConditionOccurrenceInpatientLongTerm = FALSE,
useConditionOccurrenceInpatientMediumTerm = FALSE,
useConditionOccurrenceInpatientShortTerm = FALSE,
useConditionEraAnyTimePrior = FALSE,
useConditionEraLongTerm = FALSE,
useConditionEraMediumTerm = FALSE,
useConditionEraShortTerm = FALSE,
useConditionEraOverlapping = FALSE,
useConditionEraStartLongTerm = FALSE,
useConditionEraStartMediumTerm = FALSE,
useConditionEraStartShortTerm = FALSE,
useConditionGroupEraAnyTimePrior = FALSE,
useConditionGroupEraLongTerm = TRUE,
useConditionGroupEraMediumTerm = TRUE,
useConditionGroupEraShortTerm = TRUE,
useConditionGroupEraOverlapping = FALSE,
useConditionGroupEraStartLongTerm = FALSE,
useConditionGroupEraStartMediumTerm = FALSE,
useConditionGroupEraStartShortTerm = FALSE,
useDrugExposureAnyTimePrior = TRUE,
useDrugExposureLongTerm = TRUE,
useDrugExposureMediumTerm = TRUE,
useDrugExposureShortTerm = TRUE,
useDrugEraAnyTimePrior = TRUE,
useDrugEraLongTerm = TRUE,
useDrugEraMediumTerm = TRUE,
useDrugEraShortTerm = TRUE,
useDrugEraOverlapping = TRUE,
useDrugEraStartLongTerm = TRUE,
useDrugEraStartMediumTerm = TRUE,
useDrugEraStartShortTerm = TRUE,
useDrugGroupEraAnyTimePrior = TRUE,
useDrugGroupEraLongTerm = TRUE,
useDrugGroupEraMediumTerm = TRUE,
useDrugGroupEraShortTerm = TRUE,
useDrugGroupEraOverlapping = TRUE,
useDrugGroupEraStartLongTerm = TRUE,
useDrugGroupEraStartMediumTerm = TRUE,
useDrugGroupEraStartShortTerm = TRUE,
useProcedureOccurrenceAnyTimePrior = FALSE,
useProcedureOccurrenceLongTerm = TRUE,
useProcedureOccurrenceMediumTerm = TRUE,
useProcedureOccurrenceShortTerm = TRUE,
useDeviceExposureAnyTimePrior = FALSE,
useDeviceExposureLongTerm = TRUE,
useDeviceExposureMediumTerm = TRUE,
useDeviceExposureShortTerm = TRUE,
useMeasurementAnyTimePrior = FALSE,
useMeasurementLongTerm = TRUE,
useMeasurementMediumTerm = TRUE,
useMeasurementShortTerm = TRUE,
useMeasurementValueAnyTimePrior = FALSE,
useMeasurementValueLongTerm = FALSE,
useMeasurementValueMediumTerm = FALSE,
useMeasurementValueShortTerm = FALSE,
useMeasurementRangeGroupAnyTimePrior = FALSE,
useMeasurementRangeGroupLongTerm = TRUE,
useMeasurementRangeGroupMediumTerm = FALSE,
useMeasurementRangeGroupShortTerm = FALSE,
useObservationAnyTimePrior = FALSE,
useObservationLongTerm = TRUE,
useObservationMediumTerm = TRUE,
useObservationShortTerm = TRUE,
useCharlsonIndex = TRUE,
useDcsi = TRUE,
useChads2 = TRUE,
useChads2Vasc = TRUE,
useDistinctConditionCountLongTerm = FALSE,
useDistinctConditionCountMediumTerm = FALSE,
useDistinctConditionCountShortTerm = FALSE,
useDistinctIngredientCountLongTerm = FALSE,
useDistinctIngredientCountMediumTerm = FALSE,
useDistinctIngredientCountShortTerm = FALSE,
useDistinctProcedureCountLongTerm = FALSE,
useDistinctProcedureCountMediumTerm = FALSE,
useDistinctProcedureCountShortTerm = FALSE,
useDistinctMeasurementCountLongTerm = FALSE,
useDistinctMeasurementCountMediumTerm = FALSE,
useDistinctMeasurementCountShortTerm = FALSE,
useDistinctObservationCountLongTerm = FALSE,
useDistinctObservationCountMediumTerm = FALSE,
useDistinctObservationCountShortTerm = FALSE,
useVisitCountLongTerm = FALSE,
useVisitCountMediumTerm = FALSE,
useVisitCountShortTerm = FALSE,
useVisitConceptCountLongTerm = FALSE,
useVisitConceptCountMediumTerm = FALSE,
useVisitConceptCountShortTerm = FALSE,
longTermStartDays = 365,
mediumTermStartDays = 180,
shortTermStartDays = 30,
endDays = 0,
includedCovariateConceptIds = c(),
addDescendantsToInclude = FALSE,
excludedCovariateConceptIds = c(),
addDescendantsToExclude = FALSE,
includedCovariateIds = c())```
Hi James, Thanks for letting us know about this issue. I had a look at the code and there seems to be checks for the NULL using length(), so I think it might be a different issue. Do you run getPlpData() to get the data and how much space do you have on the drive you're using to run the code? Also, what version of PatientLevelPrediction are you using?
Hi, Thanks for your reply. Yes, I run the getPlpData() statement below:
plpData <- PatientLevelPrediction::getPlpData(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortId = targetCohortId,
outcomeIds = outcomeList,
studyStartDate = "",
studyEndDate = "",
cohortDatabaseSchema = cohortsDatabaseSchema,
cohortTable = cohortTable,
outcomeDatabaseSchema = cohortsDatabaseSchema,
outcomeTable = outcomeTable,
cdmVersion = cdmVersion,
firstExposureOnly = FALSE,
washoutPeriod = 0,
sampleSize = 100000,
covariateSettings = covariateSettings)
There is about 93GB of free space on the drive that contains my home directory:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 72K 16G 1% /dev
tmpfs 16G 0 16G 0% /dev/shm
/dev/xvda1 99G 5.8G 93G 6% /
/dev/xvdb1 5.0G 52M 5.0G 2% /emr
/dev/xvdb2 95G 589M 95G 1% /mnt
Below is my sessionInfo(). Looks like I'm using PatientLevelPredition 2.0.0:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2017.09
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] shiny_1.0.5 PatientLevelPrediction_2.0.0 Cyclops_1.3.1 FeatureExtraction_2.1.1
[5] DatabaseConnector_2.0.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 bindr_0.1.1 compiler_3.4.1 pillar_1.2.1 futile.logger_1.4.3 plyr_1.8.4
[7] futile.options_1.0.0 tools_3.4.1 digest_0.6.15 bit_1.1-12 viridisLite_0.3.0 jsonlite_1.5
[13] tibble_1.4.2 gtable_0.2.0 lattice_0.20-35 ff_2.2-13 pkgconfig_2.0.1 rlang_0.2.0
[19] Matrix_1.2-10 fastmatch_1.1-0 bindrcpp_0.2 rJava_0.9-9 httr_1.3.1 dplyr_0.7.4
[25] htmlwidgets_1.0 grid_3.4.1 glue_1.2.0 data.table_1.10.4-3 ffbase_0.12.3 R6_2.2.2
[31] plotly_4.7.1 survival_2.41-3 tidyr_0.8.0 purrr_0.2.4 magrittr_1.5 ggplot2_2.2.1
[37] SqlRender_1.4.8 lambda.r_1.2 htmltools_0.3.6 scales_0.5.0 splines_3.4.1 assertthat_0.2.0
[43] xtable_1.8-2 mime_0.5 colorspace_1.3-2 httpuv_1.3.6.2 RcppParallel_4.4.0 lazyeval_0.2.1
[49] munsell_0.4.3
Thanks for the extra info. If you installed PatientLevelPrediction 2 months ago or longer can you try installing the latest master version and see whether that fixes things (plus it has a lot more functions to make validation easier now)? I added the length() check about 3 months ago I think, so hopefully you just have the code that wasn't checking for NULL. If that doesn't solve things (or you only installed PatientLevelPrediction a few days ago), then is it possible to send me the log file created when you do runPlp()?
Hmm... Should I be installing it from a source other than DRAT? Below is a log on my installation. I just installed it a few days ago and it looks like I'm pulling 2.0.0:
> drat::addRepo("OHDSI")
> install.packages("PatientLevelPrediction")
Installing package into ‘/home/wigginjs/R/x86_64-redhat-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://OHDSI.github.io/drat/src/contrib/PatientLevelPrediction_2.0.0.tar.gz'
Sorry, I mean the log that is created when you do runPlp() - it is called plplog.txt. That way I can see where it is failing. If you type length(NULL) does it return 0 for you on linux?
Hi, Thank you for your help.
Yes, length(NULL) does return 0. See following:
> length(NULL)
[1] 0
Also, here is the content of plplog.txt:
************************************************************************************************
Patient-Level Prediction Package version 2.0.0
************************************************************************************************
AnalysisID: 20180324150554
CohortID: 9
OutcomeID: 7
Cohort size: 401
Covariates: 4065
Population size: 401
Cases: 57
************************************************************************************************
Creating 25% test and 75% train (into 3 folds) stratified split at 2008-06-10
Data split into 100 test cases and 301 train samples (101, 101, 99)
************************************************************************************************
Training Lasso Logistic Regression model
Removing infrequent covariates
Removing infrequent covariates took 0.00568 secs
Normalizing covariates
Normalizing covariates took 0.0544 secs
Removing redundant covariates
Removing redundant covariates took 0.0967 secs
Model saved to ..\20180324150554\savedModel
************************************************************************************************
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata
Something that may be related is the actual Python binary that is being called by Plp. I have Anaconda 2 installed and reference it first in my PATH environment variable as shown following:
> Sys.setenv(PATH=":/usr/anaconda2/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin")
>
> system("python --version")
Python 2.7.14 :: Anaconda, Inc.
But when I run PatientLevelPrediction::checkPlpInstallation() I get the following results:
.......
************************************************************************************************
************************************************************************************************
Calculating covariate summary @ 2018-03-24 16:54:14
This can take a while...
Finished covariate summary @ 2018-03-24 16:54:21
Log saved to /home/wigginjs/output/output/output/output/output/plpmodels/20180324165359/plplog.txt
Run finished successfully.
- Ok
Initialize Python Version 2.7.13 (default, Jan 31 2018, 00:17:36)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
......
Does Plp use the OS to find the Python binary? Or is the path to it determined in another way?
Hi James,
I think the issue is that the covariates are becoming empty, I've added checks for this now in these two functions:
applyTidyCovariateData <- function(plpData,preprocessSettings){
covariates <- plpData$covariates
maxs <- preprocessSettings$normFactors
deleteCovariateIds <- preprocessSettings$deletedRedundantCovariateIds
deletedInfrequentCovariateIds <- preprocessSettings$deletedInfrequentCovariateIds
writeLines("Removing infrequent covariates")
start <- Sys.time()
if (length(deletedInfrequentCovariateIds) != 0) {
idx <- !ffbase::`%in%`(covariates$covariateId, deletedInfrequentCovariateIds)
if(sum(idx)>0){
covariates <- covariates[idx, ]
} else{
stop('No covariates left')
}
}
delta <- Sys.time() - start
writeLines(paste("Removing infrequent covariates took", signif(delta, 3), attr(delta, "units")))
writeLines("Normalizing covariates")
start <- Sys.time()
ffdfMaxs <- ff::as.ffdf(maxs)
names(ffdfMaxs)[names(ffdfMaxs) == "bins"] <- "covariateId"
covariates <- ffbase::merge.ffdf(covariates, ffdfMaxs)
for (i in bit::chunk(covariates)) {
covariates$covariateValue[i] <- covariates$covariateValue[i]/covariates$maxs[i]
}
covariates$maxs <- NULL
delta <- Sys.time() - start
writeLines(paste("Normalizing covariates took", signif(delta, 3), attr(delta, "units")))
writeLines("Removing redundant covariates")
start <- Sys.time()
if (length(deleteCovariateIds) != 0) {
idx <- !ffbase::`%in%`(covariates$covariateId, deleteCovariateIds)
if(sum(idx)>0){
covariates <- covariates[idx, ]
} else{
stop('No covariates left')
}
}
delta <- Sys.time() - start
writeLines(paste("Removing redundant covariates took", signif(delta, 3), attr(delta, "units")))
plpData$covariates <- covariates
return(plpData)
}
and
limitCovariatesToPopulation <- function(covariates, rowIds) {
idx <- !is.na(ffbase::ffmatch(covariates$rowId, rowIds))
if(sum(idx)!=0){
covariates <- covariates[ffbase::ffwhich(idx, idx == TRUE), ]
}else{
stop('No covariates')
}
return(covariates)
}
try adding these functions to your session and than doing the runPlp as it should then use these updated versions and will then tell you if the covariates are being empty.
The python issue is a problem, but isn't causing the current issue. On windows the R package we use to connect to python can detect the python version we want, but on linux is will use the default python you have set up (e.g., the python you get when typing python in the terminal). You need to configure the anaconda to be the default python (having that first in the path should work) and then restart R for the python code to run.
Thanks. I hope I did this right. I pasted in your new function code using the following R commands:
fixInNamespace("applyTidyCovariateData", ns = "PatientLevelPrediction")
fixInNamespace("limitCovariatesToPopulation", ns = "PatientLevelPrediction")
And below is the result from plplog.txt:
************************************************************************************************
Patient-Level Prediction Package version 2.0.0
************************************************************************************************
AnalysisID: 20180327125619
CohortID: 9
OutcomeID: 7
Cohort size: 401
Covariates: 4065
Population size: 401
Cases: 57
************************************************************************************************
Creating 25% test and 75% train (into 3 folds) stratified split at 2008-06-10
Data split into 100 test cases and 301 train samples (101, 101, 99)
************************************************************************************************
Training Lasso Logistic Regression model
Removing infrequent covariates
Removing infrequent covariates took 0.0056 secs
Normalizing covariates
Normalizing covariates took 0.0275 secs
Removing redundant covariates
Removing redundant covariates took 0.035 secs
No non-zero coefficients
No non-zero coefficients
Model saved to ..\20180327125619\savedModel
************************************************************************************************
simpleWarning in predictFfdf(predictiveModel$coefficients, population, covariates, : Model had no non-zero coefficients so predicted same for all population...
Also, to the console:
Error: $ operator is invalid for atomic vectors
Does this indicate a problem with my input data?
Hi James,
Thanks for running the updated code. It looks like the code fixed the issue, but now that it has run the logistic regression didn't pick any variables into the model (the "No non-zero coefficient" output). In my experience this normally means the data are not able to predict the outcome well. If you try a gradient boosting machine instead of the logistic regression then you should get a model but the performance will probably be poor.
Also, if you get the default python to be the anaconda then you can also try the random forest or neural network models that use a python back end.
Please let me know if you have any other issues :)
Best wishes, Jenna
Hi Jenna,
Hah, yes, you are correct. GradientBoostingMachine did complete successfully. Are you going to commit your code changes? Will I be able to pull code with them through DRAT?
Thank you so much for your help!!
Hi James,
Great, I added the edits and the new version with the fix should be 2.0.1.
Best wishes, Jenna
I receive the below error,
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : need vmode or initdata
when running the following code:
I receive the same error even if I use different statistical models. I've tried to trace the error and it seems to occur when the follow code is run:
From predictPlp:
Could this be the result of an issue with my input data? Or perhaps my environment?