issue creating a random forest model

adamstruck commented 5 years ago

Describe the bug Can't run a random forest model.

checkPlpInstallation always fails on only the random forest test.

To Reproduce

library(PatientLevelPrediction)

set.seed(1234)
data(plpDataSimulationProfile)
sampleSize <- 2000
plpData <- simulatePlpData(plpDataSimulationProfile, n = sampleSize)

population <- createStudyPopulation(plpData,
                                    outcomeId = 2,
                                    firstExposureOnly = FALSE,
                                    washoutPeriod = 0,
                                    removeSubjectsWithPriorOutcome = FALSE,
                                    priorOutcomeLookback = 99999,
                                    requireTimeAtRisk = FALSE,
                                    minTimeAtRisk=0,
                                    riskWindowStart = 0,
                                    addExposureDaysToStart = FALSE,
                                    riskWindowEnd = 365,
                                    addExposureDaysToEnd = FALSE,
                                    verbosity="DEBUG")

modset <- PatientLevelPrediction::setRandomForest()
model <- tryCatch({
    PatientLevelPrediction::runPlp(population, plpData, modelSettings = modset,
                                   testFraction = 0.5, nfold = 3,
                                   minCovariateFraction = 0,
                                   saveEvaluation = F,
                                   savePlpData = F,
                                   savePlpResult = F,
                                   savePlpPlots = F)
}, error = function(e) {
    message(e)
})

sessionInfo()

Log File

# R --vanilla < /scratch/crash.R

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(PatientLevelPrediction)
Loading required package: DatabaseConnector
Loading required package: FeatureExtraction
Loading required package: Cyclops

Attaching package: ‘PatientLevelPrediction’

The following object is masked from ‘package:FeatureExtraction’:

    bySumFf

>
> set.seed(1234)
> data(plpDataSimulationProfile)
> sampleSize <- 2000
> plpData <- simulatePlpData(plpDataSimulationProfile, n = sampleSize)
Generating covariates
Generating cohorts
Generating outcomes
Generating exclusion
>
> population <- createStudyPopulation(plpData,
+                                     outcomeId = 2,
+                                     firstExposureOnly = FALSE,
+                                     washoutPeriod = 0,
+                                     removeSubjectsWithPriorOutcome = FALSE,
+                                     priorOutcomeLookback = 99999,
+                                     requireTimeAtRisk = FALSE,
+                                     minTimeAtRisk=0,
+                                     riskWindowStart = 0,
+                                     addExposureDaysToStart = FALSE,
+                                     riskWindowEnd = 365,
+                                     addExposureDaysToEnd = FALSE,
+                                     verbosity="DEBUG")
Outcome is 0 or 1
>
> modset <- PatientLevelPrediction::setRandomForest()
> model <- tryCatch({
+     PatientLevelPrediction::runPlp(population, plpData, modelSettings = modset,
+                                    testFraction = 0.5, nfold = 3,
+                                    minCovariateFraction = 0,
+                                    saveEvaluation = F,
+                                    savePlpData = F,
+                                    savePlpResult = F,
+                                    savePlpPlots = F)
+ }, error = function(e) {
+     message(e)
+ })
Patient-Level Prediction Package version 3.0.6
AnalysisID:         20191029203010
CohortID:           0
OutcomeID:          2
Cohort size:        2000
Covariates:         33801
Population size:    2000
Cases:              149
Creating 50% test and 50% train (into 3 folds) stratified split at 2009-12-25
Data split into 999 test cases and 1001 train samples (335, 335, 331)
Training Random forest model
Removing redundant covariates
Removing redundant covariates took 0.433 secs
Normalizing covariates
Normalizing covariates took 1.15 secs
/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
Using Random Forest to select features
population loaded- 2000 rows and 3 columns
Error in py_call_impl(callable, dots$args, dots$keywords): IndexError: arrays used as indices must be of integer (or boolean) type

Detailed traceback:
  File "<string>", line 29, in train_rf
  File "/usr/local/lib/python3.7/site-packages/scipy/sparse/_index.py", line 53, in __getitem__
    return self._get_sliceXarray(row, col)
  File "/usr/local/lib/python3.7/site-packages/scipy/sparse/csc.py", line 222, in _get_sliceXarray
    return self._major_index_fancy(col)._minor_slice(row)
  File "/usr/local/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 690, in _major_index_fancy
    np.cumsum(row_nnz[idx], out=res_indptr[1:])

>
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

locale:
 [1] LC_CTYPE=en_US.utf8          LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8           LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=en_US.utf8       LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8          LC_NAME=en_US.utf8
 [9] LC_ADDRESS=en_US.utf8        LC_TELEPHONE=en_US.utf8
[11] LC_MEASUREMENT=en_US.utf8    LC_IDENTIFICATION=en_US.utf8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] PatientLevelPrediction_3.0.6 Cyclops_2.0.2
[3] FeatureExtraction_2.2.5      DatabaseConnector_2.4.1

loaded via a namespace (and not attached):
 [1] reticulate_1.13      tidyselect_0.2.5     ffbase_0.12.7
 [4] purrr_0.3.3          splines_3.3.3        rJava_0.9-11
 [7] lattice_0.20-34      colorspace_1.4-1     vctrs_0.2.0
[10] htmltools_0.4.0      viridisLite_0.3.0    survival_2.40-1
[13] plotly_4.9.0         rlang_0.4.1          pillar_1.4.2
[16] glue_1.3.1           DBI_1.0.0            ParallelLogger_1.1.0
[19] foreach_1.4.7        lifecycle_0.1.0      plyr_1.8.4
[22] munsell_0.5.0        gtable_0.3.0         htmlwidgets_1.5.1
[25] codetools_0.2-15     ff_2.2-14            Rcpp_1.0.2
[28] scales_1.0.0         backports_1.1.5      jsonlite_1.6
[31] bit_1.1-14           fastmatch_1.1-0      ggplot2_3.2.1
[34] digest_0.6.22        dplyr_0.8.3          SqlRender_1.6.3
[37] grid_3.3.3           tools_3.3.3          magrittr_1.5
[40] lazyeval_0.2.2       tibble_2.1.3         crayon_1.3.4
[43] tidyr_1.0.0          pkgconfig_2.0.3      zeallot_0.1.0
[46] MASS_7.3-45          Matrix_1.2-7.1       data.table_1.12.6
[49] assertthat_0.2.1     httr_1.4.1           iterators_1.0.12
[52] R6_2.4.0

Additional context

Here are the installed python libraries.

# pip freeze
absl-py==0.8.1
astor==0.8.0
gast==0.2.2
google-pasta==0.1.7
grpcio==1.24.3
h5py==2.10.0
joblib==0.14.0
Keras==2.3.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
Markdown==3.1.1
numpy==1.17.3
opt-einsum==3.1.0
pandas==0.25.2
Pillow==6.2.1
protobuf==3.10.0
pydotplus==2.0.2
pyparsing==2.4.2
python-dateutil==2.8.0
pytz==2019.3
PyYAML==5.1.2
scikit-learn==0.21.3
scipy==1.3.1
six==1.12.0
tensorboard==2.0.0
tensorflow==2.0.0
tensorflow-estimator==2.0.1
termcolor==1.1.0
torch==1.3.0
torchvision==0.4.1
Werkzeug==0.16.0
wrapt==1.11.2

jreps commented 5 years ago

Hi Adam, thanks for letting me know about this. My guess is that I need to cast some array to an integer (it may be a double that was getting automatically cast in older python code), should hopefully be a quick fix. I'll let you know once I've made the edit.

adamstruck commented 5 years ago

Thanks, is there a version of python or PLP that you would recommend in the mean time?

jreps commented 5 years ago

Hi Adam, I've cast all indexes to integer in python now, does the latest PLP work for you?

adamstruck commented 5 years ago

I am still getting the same error using version: 30cbef47

jreps commented 5 years ago

how about now?

adamstruck commented 5 years ago

Seems to be working now, thanks!

OHDSI / PatientLevelPrediction

issue creating a random forest model #152