OHDSI / FeatureExtraction

An R package for generating features (covariates) for a cohort using data in the Common Data Model.
http://ohdsi.github.io/FeatureExtraction/
61 stars 60 forks source link

agregating data in temporal-covarate-settings produces NA timeId in covaraitesContinuous #114

Open javier-gracia-tabuenca-tuni opened 3 years ago

javier-gracia-tabuenca-tuni commented 3 years ago
tcovset <- createTemporalCovariateSettings(
   useVisitCconceptCount = TRUE, 
   temporalStartDays = c(1, 365), 
   temporalEndDays = c(365, 365*2),
)

tempCov <- getDbCovariateData(
   ...
   ...
   covariateSettings = tcovset, 
   aggregated = TRUE
)

tempCov$covariateContinuous %>% view()

Column timeId is all NA

I can't find why this happens. I came to find this error when using CohortDiagnostics and I trying to get the temporal characterization of visit counts.

anthonysena commented 3 years ago

Hi @javier-gracia-tabuenca-tuni - can you link me to where in CohortDiagnostics where you found this code? Upon first glance, the parameter useVisitCconceptCount appears to be malformed (but might just be a typo here).

javier-gracia-tabuenca-tuni commented 3 years ago

Sorry, this code is not in CohortDiagnostics.

I'm trying to run a custom cohort diagnosis which shows in Temporal Characterisation the count of visits in each time period.

Here is some code to reproduce the error:


title: "R Notebook" output: html_notebook

library(tidyverse)
Sys.setenv(JAVA_HOME='C:\\Program Files\\AdoptOpenJDK\\jdk-8.0.232.09-hotspot\\')
library(SqlRender)
library(Eunomia)
library(CohortDiagnostics)

working_folder <- getwd()
connectionDetails <- getEunomiaConnectionDetails()
connection <- connect(connectionDetails)

cdmDatabaseSchema <- "main"
oracleTempSchema <- NULL
cohortDatabaseSchema <- "main"
cohortTable <- "tmp_cohort_table"
cohortSetReference <- tibble(
  atlasId = 1776012  , 
  atlasName = "Asthma",
  cohortId = 1776012  ,
  name = "Asthma"
)
createCohortTable(connectionDetails = connectionDetails,
                  cohortDatabaseSchema = cohortDatabaseSchema,
                  cohortTable = cohortTable)
baseUrl <- "http://api.ohdsi.org:80/WebAPI"
inclusionStatisticsFolder <- file.path(working_folder, "incStats")

instantiateCohortSet(connectionDetails = connectionDetails,
                     cdmDatabaseSchema = cdmDatabaseSchema,
                     oracleTempSchema = oracleTempSchema,
                     cohortDatabaseSchema = cohortDatabaseSchema,
                     cohortTable = cohortTable,
                     baseUrl = baseUrl,
                     cohortSetReference = cohortSetReference,
                     generateInclusionStats = TRUE,
                     inclusionStatisticsFolder = inclusionStatisticsFolder)

my_tempCovaraitesSettings <- createTemporalCovariateSettings(
   useVisitConceptCount = TRUE, 
   temporalStartDays = c(-365, 0   , 30*2+1, 365*1+1), 
   temporalEndDays =   c(  -1, 30*2,  365*1, 365*2+1)
)
databaseId <- "testDB"
exportFolder <- file.path(working_folder, "export")
runCohortDiagnostics(baseUrl = baseUrl,
                     cohortSetReference = cohortSetReference,
                     #
                     connectionDetails = connectionDetails,
                     cdmDatabaseSchema = cdmDatabaseSchema,
                     oracleTempSchema = oracleTempSchema,
                     cohortDatabaseSchema = cohortDatabaseSchema, 
                     cohortTable = cohortTable,
                     #
                     databaseId = databaseId,
                     databaseName = databaseId,
                     databaseDescription = databaseId, 
                     #
                     inclusionStatisticsFolder = inclusionStatisticsFolder,
                     exportFolder = exportFolder,
                     #
                     runInclusionStatistics = TRUE,
                     runIncludedSourceConcepts = TRUE,
                     runOrphanConcepts = TRUE,
                     runTimeDistributions = TRUE,
                     runBreakdownIndexEvents = TRUE,
                     runIncidenceRate = TRUE,
                     runCohortOverlap = TRUE,
                     runCohortCharacterization = TRUE,
                     minCellCount = 5, 
                     #
                     runTemporalCohortCharacterization = TRUE,
                     temporalCovariateSettings = my_tempCovaraitesSettings
                     )
read_csv("./export/temporal_time_ref.csv")
read_csv("./export/temporal_covariate_value.csv")
preMergeDiagnosticsFiles(exportFolder)
launchDiagnosticsExplorer(exportFolder)

image

javier-gracia-tabuenca-tuni commented 3 years ago

However, I traced the error and it seems to originate in FeatureExtraction. For this reason I posted it here.

Here is a simpler code to reproduce the error in FeatureExtraction:


title: "R Notebook" output: html_notebook

library(tidyverse)
Sys.setenv(JAVA_HOME='C:\\Program Files\\AdoptOpenJDK\\jdk-8.0.232.09-hotspot\\')
library(SqlRender)
library(Eunomia)
library(FeatureExtraction)

working_folder <- getwd()
connectionDetails <- getEunomiaConnectionDetails()
connection <- connect(connectionDetails)

cdmDatabaseSchema <- "main"
oracleTempSchema <- NULL
cohortDatabaseSchema <- "main"
cohortTable <- "tmp_cohort_table"
cohortSetReference <- tibble(
  atlasId = 1776012  , 
  atlasName = "Asthma",
  cohortId = 1776012  ,
  name = "Asthma"
)
createCohortTable(connectionDetails = connectionDetails,
                  cohortDatabaseSchema = cohortDatabaseSchema,
                  cohortTable = cohortTable)
baseUrl <- "http://api.ohdsi.org:80/WebAPI"
inclusionStatisticsFolder <- "~/incStats"

instantiateCohortSet(connectionDetails = connectionDetails,
                     cdmDatabaseSchema = cdmDatabaseSchema,
                     oracleTempSchema = oracleTempSchema,
                     cohortDatabaseSchema = cohortDatabaseSchema,
                     cohortTable = cohortTable,
                     baseUrl = baseUrl,
                     cohortSetReference = cohortSetReference,
                     generateInclusionStats = TRUE,
                     inclusionStatisticsFolder = inclusionStatisticsFolder)
my_tempCovaraitesSettings <- createTemporalCovariateSettings(
   useVisitConceptCount = TRUE, 
   temporalStartDays = c(-365, 0   , 30*2+1, 365*1+1), 
   temporalEndDays =   c(  -1, 30*2,  365*1, 365*2+1)
)

tempCov <- getDbCovariateData(
  connectionDetails = connectionDetails,
  oracleTempSchema = oracleTempSchema,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortTable = cohortTable,
  cohortDatabaseSchema = cdmDatabaseSchema,
  cohortTableIsTemp = FALSE,
  cohortId = 1776012,
  rowIdField = "subject_id",
  covariateSettings = my_tempCovaraitesSettings,
  aggregated = TRUE
)
tempCov$covariatesContinuous

image

Should be one value for each time_id not only one NA

anthonysena commented 3 years ago

Hi @javier-gracia-tabuenca-tuni - thanks for the code to help reproduce this problem! I've taken your code and made some changes changed it since there were some functions you referenced that are missing and I don't have the Asthma cohort referenced. So, I constructed a very basic cohort to use anyone in the Eunomia data set that has an observation_period entry to keep things simple. Below is the code:

library(SqlRender)
library(Eunomia)
#> Loading required package: DatabaseConnector
library(FeatureExtraction)
#> Loading required package: Andromeda
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

connectionDetails <- getEunomiaConnectionDetails()
connection <- connect(connectionDetails)
#> Connecting using SQLite driver

cdmDatabaseSchema <- "main"
oracleTempSchema <- NULL
cohortDatabaseSchema <- "main"
cohortTable <- "cohort"

# Construct a cohort of people
cohortMembers <- querySql(connection, "SELECT 1 cohort_definition_id, 
                                              person_id subject_id, 
                                              observation_period_start_date cohort_start_date, 
                                              observation_period_end_date cohort_end_date 
                                       FROM observation_period")

# Insert into the cohort table
DatabaseConnector::insertTable(connection = connection,
                               tableName = cohortTable,
                               data = cohortMembers,
                               dropTableIfExists = FALSE,
                               createTable = FALSE)

# Characterize visit counts over the following time windows
tempCovaraitesSettings <- createTemporalCovariateSettings(
  useVisitConceptCount = TRUE, 
  temporalStartDays = c(0,31,61,91), 
  temporalEndDays =   c(30,60,90,365*2+1)
)

tempCov <- getDbCovariateData(
  connectionDetails = connectionDetails,
  oracleTempSchema = oracleTempSchema,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortTable = cohortTable,
  cohortDatabaseSchema = cdmDatabaseSchema,
  cohortTableIsTemp = FALSE,
  cohortId = 1,
  rowIdField = "subject_id",
  covariateSettings = tempCovaraitesSettings,
  aggregated = TRUE
)
#> Connecting using SQLite driver
#> Sending temp tables to server
#> Constructing features on server
#>   |                                                                              |                                                                      |   0%  |                                                                              |==                                                                    |   4%  |                                                                              |=====                                                                 |   7%  |                                                                              |========                                                              |  11%  |                                                                              |==========                                                            |  14%  |                                                                              |============                                                          |  18%  |                                                                              |===============                                                       |  21%  |                                                                              |==================                                                    |  25%  |                                                                              |====================                                                  |  29%  |                                                                              |======================                                                |  32%  |                                                                              |=========================                                             |  36%  |                                                                              |============================                                          |  39%  |                                                                              |==============================                                        |  43%  |                                                                              |================================                                      |  46%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================                                |  54%  |                                                                              |========================================                              |  57%  |                                                                              |==========================================                            |  61%  |                                                                              |=============================================                         |  64%  |                                                                              |================================================                      |  68%  |                                                                              |==================================================                    |  71%  |                                                                              |====================================================                  |  75%  |                                                                              |=======================================================               |  79%  |                                                                              |==========================================================            |  82%  |                                                                              |============================================================          |  86%  |                                                                              |==============================================================        |  89%  |                                                                              |=================================================================     |  93%  |                                                                              |====================================================================  |  96%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0273 secs
#> Fetching data from server
#> Fetching data took 0.125 secs

data.frame(tempCov$timeRef)
#>   timeId startDay endDay
#> 1      1        0     30
#> 2      2       31     60
#> 3      3       61     90
#> 4      4       91    731
data.frame(tempCov$covariatesContinuous)
#>   cohortDefinitionId covariateId countValue minValue maxValue averageValue
#> 1                  1     9201911         15        0        1  0.002807412
#>   standardDeviation medianValue p10Value p25Value p75Value p90Value timeId
#> 1                 0           0        0        0        0        0     NA

Created on 2021-01-12 by the reprex package (v0.3.0)

As shown, the temporal analysis attempts to use 4 different time periods but the resulting continuous covariate result does not use the timeId as you originally reported. So, I dug a bit deeper to try and find the issue and tracked it down to the ConceptCount.sql file. What I found is that some of the SQL operations in this file do not use the time_id column throughout the analysis for temporal analysis. Additionally, I found the following code in that script that seems to imply that we do not want to consider the time_id as part of the results:

https://github.com/OHDSI/FeatureExtraction/blob/da48fda9f27cb3c8fa34cc3f91ee741071dc51bb/inst/sql/sql_server/ConceptCounts.sql#L177-L179

@schuemie - tagging you here to get your input on this particular code to ask: is this a bug or a decision to not include the temporal time identifiers as part of the results when constructing these types of covariates? Or does this perspective change at all when doing this at a patient level vs aggregating the results?

javier-gracia-tabuenca-tuni commented 3 years ago

Thanks Anthon. The asthma cohort was in atlas-demo.ohdsi.org, but yeap, using query to build cohort much better for reproducing. I'll do it like that next time.

I see you dug deeper than me. Let's wait what @schuemie has to say.