OHDSI / FeatureExtraction

An R package for generating features (covariates) for a cohort using data in the Common Data Model.
http://ohdsi.github.io/FeatureExtraction/
60 stars 58 forks source link

Different covariates (corresponding to analysisId=410 and analysisId=414) are assigned the same covariateName by FeatureExtraction #219

Closed conovermitch closed 8 months ago

conovermitch commented 10 months ago

See screen grab below for a bug where FeatureExtraction seems to be producing two distinct covariates that have two different results but have the same value for covariateName.

image

analysisId=410: https://github.com/OHDSI/FeatureExtraction/blob/main/inst/csv/PrespecAnalyses.csv#L51

analysisId=414: https://github.com/OHDSI/FeatureExtraction/blob/main/inst/csv/PrespecAnalyses.csv#L55

The SQL that constructs the covariate and creates the covariateName is here: https://github.com/OHDSI/FeatureExtraction/blob/main/inst/sql/sql_server/DomainConceptGroup.sql

@jamieweaver and I were reviewing a Shiny app that showed the same covariate name with different values and after hunting for the problem we found the issue above.

Version of FeaturExtraction being run: FeatureExtraction v3.3.1

anthonysena commented 10 months ago

Thanks for reporting this @conovermitch and @jamieweaver. I can confirm this is confusing and put together a small reproducible example to further illustrate the problem:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
Eunomia::createCohorts(connectionDetails)
#> Connecting using SQLite driver
#> Creating cohort: Celecoxib
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0152 secs
#> Creating cohort: Diclofenac
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0182 secs
#> Creating cohort: GiBleed
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0307 secs
#> Creating cohort: NSAIDs
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0835 secs
#> Cohorts created in table main.cohort
#>   cohortId       name
#> 1        1  Celecoxib
#> 2        2 Diclofenac
#> 3        3    GiBleed
#> 4        4     NSAIDs
#>                                                                                        description
#> 1    A simplified cohort definition for new users of celecoxib, designed specifically for Eunomia.
#> 2    A simplified cohort definition for new users ofdiclofenac, designed specifically for Eunomia.
#> 3 A simplified cohort definition for gastrointestinal bleeding, designed specifically for Eunomia.
#> 4       A simplified cohort definition for new users of NSAIDs, designed specifically for Eunomia.
#>   count
#> 1  1844
#> 2   850
#> 3   479
#> 4  2694

covSettings <- FeatureExtraction::createCovariateSettings(
  useDrugGroupEraLongTerm = TRUE,
  useDrugGroupEraStartLongTerm = TRUE
)

covariateData <- FeatureExtraction::getDbCovariateData(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  covariateSettings = covSettings,
  aggregated = TRUE
)
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> Constructing features on server
#>   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   5%  |                                                                              |======                                                                |   9%  |                                                                              |==========                                                            |  14%  |                                                                              |=============                                                         |  18%  |                                                                              |================                                                      |  23%  |                                                                              |===================                                                   |  27%  |                                                                              |======================                                                |  32%  |                                                                              |=========================                                             |  36%  |                                                                              |=============================                                         |  41%  |                                                                              |================================                                      |  45%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================                                |  55%  |                                                                              |=========================================                             |  59%  |                                                                              |=============================================                         |  64%  |                                                                              |================================================                      |  68%  |                                                                              |===================================================                   |  73%  |                                                                              |======================================================                |  77%  |                                                                              |=========================================================             |  82%  |                                                                              |============================================================          |  86%  |                                                                              |================================================================      |  91%  |                                                                              |===================================================================   |  95%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.543 secs
#> Fetching data from server
#> Fetching data took 0.229 secs
<sup>Created on 2023-12-14 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

Note the names for analysis 410 (DrugGroupEraLongTerm) are the same as 414 (DrugGroupEraStartLongTerm)

> covariateData$covariateRef %>%
+   filter(analysisId == 410) %>%
+   select(covariateId, covariateName)
# Source:   SQL [?? x 2]
# Database: sqlite 3.41.2 [C:\Users\asena5\AppData\Local\Temp\1\RtmpeexOFI\file93483a666678.sqlite]
   covariateId covariateName                                                                   
         <dbl> <chr>                                                                           
 1   708298410 drug_era group during day -365 through 0 days relative to index: Midazolam      
 2   723013410 drug_era group during day -365 through 0 days relative to index: Diazepam       
 3   738818410 drug_era group during day -365 through 0 days relative to index: Doxylamine     
 4   740275410 drug_era group during day -365 through 0 days relative to index: Carbamazepine  
 5   753626410 drug_era group during day -365 through 0 days relative to index: Propofol       
 6   782043410 drug_era group during day -365 through 0 days relative to index: Isoflurane     
 7   914335410 drug_era group during day -365 through 0 days relative to index: Atropine       
 8   920293410 drug_era group during day -365 through 0 days relative to index: Nitrofurantoin 
 9   933724410 drug_era group during day -365 through 0 days relative to index: Phenazopyridine
10   967823410 drug_era group during day -365 through 0 days relative to index: Sodium Chloride
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
> covariateData$covariateRef %>%
+   filter(analysisId == 414) %>%
+   select(covariateId, covariateName)
# Source:   SQL [?? x 2]
# Database: sqlite 3.41.2 [C:\Users\asena5\AppData\Local\Temp\1\RtmpeexOFI\file93483a666678.sqlite]
   covariateId covariateName                                                                   
         <dbl> <chr>                                                                           
 1   708298414 drug_era group during day -365 through 0 days relative to index: Midazolam      
 2   723013414 drug_era group during day -365 through 0 days relative to index: Diazepam       
 3   738818414 drug_era group during day -365 through 0 days relative to index: Doxylamine     
 4   753626414 drug_era group during day -365 through 0 days relative to index: Propofol       
 5   782043414 drug_era group during day -365 through 0 days relative to index: Isoflurane     
 6   914335414 drug_era group during day -365 through 0 days relative to index: Atropine       
 7   920293414 drug_era group during day -365 through 0 days relative to index: Nitrofurantoin 
 8   933724414 drug_era group during day -365 through 0 days relative to index: Phenazopyridine
 9   967823414 drug_era group during day -365 through 0 days relative to index: Sodium Chloride
10   975125414 drug_era group during day -365 through 0 days relative to index: Hydrocortisone 
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
ginberg commented 10 months ago

Hi @conovermitch, I have discussed a solution with @anthonysena We propose to add the (unique) analysisName to the covariate name in brackets. For 410 the covariate name will be: drug_era group (DrugGroupEraLongTerm) during day -365 through 0 days relative to index: Midazolam

and for 414 it will be: drug_era group (DrugGroupEraStartLongTerm) during day -365 through 0 days relative to index: Midazolam how does that sound to you?