OHDSI / CohortDiagnostics

An R package for performing various cohort diagnostics.
https://ohdsi.github.io/CohortDiagnostics
40 stars 45 forks source link

export createConceptCountsTable() #1067

Open cebarboza opened 11 months ago

cebarboza commented 11 months ago

At Darwin EU we are trying to run CohortDiagnostics in a more efficient way when running it for multiple studies. We believe it would be a good enhancement to export the function createConceptCountsTable() to generate the concept_counts before executing the diagnostics.

By doing this, we can perform this calculation just once for a specific vocabulary_version, instead of repeating this process for each study.

In our fork darwin-eu-dev/CohortDiagnostics, the user can createConceptCountsTable(). This table is saved in the cohortDatabaseSchema.

CohortDiagnostics::createConceptCountsTable(connectionDetails = connectionDetails,
                                            cdmDatabaseSchema = cdmDatabaseSchema,
                                            conceptCountsDatabaseSchema = cohortDatabaseSchema,
                                            conceptCountsTable = "concept_counts",
                                            removeCurrentTable = TRUE)

Then we use the parameter useExternalConceptCountsTable in executeDiagnostics(). If TRUE, executeDiagnostics() uses the concept_counts created previously in the cohortDatabaseSchema. The user should specify the name of the external concept counts table, generally concept_counts

CohortDiagnostics::executeDiagnostics(cohortDefinitionSet =  cohortDefinitionSet,
                                      connectionDetails = connectionDetails,
                                      cohortTable = cohortTable,
                                      cohortDatabaseSchema = cohortDatabaseSchema,
                                      cdmDatabaseSchema = cdmDatabaseSchema,
                                      conceptCountsTable = "concept_counts",
                                      exportFolder = exportFolder,
                                      databaseId = "Eunomia",
                                      minCellCount = 5,
                                      useExternalConceptCountsTable = FALSE)

We also modified the CreateConceptCountTable.sql file, to add a new column with the vocabulary_version.

https://github.com/darwin-eu-dev/CohortDiagnostics/blob/ca6d9074bb097b9ce60b7bc6bf72e68a84f650fe/inst/sql/sql_server/CreateConceptCountTable.sql#L102C1-L106C2

{@table_is_temp} ? {} : { 
ALTER TABLE @work_database_schema.@concept_counts_table
ADD vocabulary_version VARCHAR(20) NULL;
UPDATE @work_database_schema.@concept_counts_table SET vocabulary_version = (SELECT vocabulary_version FROM @cdm_database_schema.vocabulary WHERE vocabulary_id = 'None');
}

Then, there are checks in place that evaluate if the vocabulary_version in the concept_counts table is equal to the version of the database the user is running the diagnostics.

https://github.com/darwin-eu-dev/CohortDiagnostics/blob/ca6d9074bb097b9ce60b7bc6bf72e68a84f650fe/R/RunDiagnostics.R#L679C1-L708C4

 # Defines variables and checks version of external concept counts table -----
  if (useExternalConceptCountsTable == FALSE) {
    conceptCountsTableIsTemp <- TRUE
    if (conceptCountsTable != "#concept_counts") {
      conceptCountsTable <- "#concept_counts"
    }
  } else {
    if (conceptCountsTable == "#concept_counts") {
      stop("Temporary conceptCountsTable name. Please provide a valid external ConceptCountsTable name")
    }
    conceptCountsTableIsTemp <- FALSE
    conceptCountsTable <- conceptCountsTable
    dataSourceInfo <- getCdmDataSourceInformation(connection = connection, cdmDatabaseSchema = cdmDatabaseSchema)
    vocabVersion <- dataSourceInfo$vocabularyVersion
    vocabVersionExternalConceptCountsTable <- DatabaseConnector::renderTranslateQuerySql(
      connection = connection,
      sql = "SELECT DISTINCT vocabulary_version FROM @work_database_schema.@concept_counts_table;",
      work_database_schema = cohortDatabaseSchema,
      concept_counts_table = conceptCountsTable,
      snakeCaseToCamelCase = TRUE,
      tempEmulationSchema = getOption("sqlRenderTempEmulationSchena")
    )
    if (!identical(vocabVersion, vocabVersionExternalConceptCountsTable[1,1])) {
      stop(paste0("External concept counts table (", 
                 vocabVersionExternalConceptCountsTable, 
                 ") does not match database (", 
                 vocabVersion, 
                 "). Update concept_counts with createConceptCountsTable()"))
    }
  }

There's also a vignette to explain how to run this functions UseExternalConceptTable.Rmd. We have been testing this approach but we wanted to discuss this before sending a pull request.

azimov commented 9 months ago

Sorry for the lateness in reply I missed this issue when you first raised it.

This approach seems reasonable, but I wonder if the solution would be better included inside CohortGenerator as an extra step following cohort creation? It seems like this solution might be generally more useful than just for cohort diagnostics - tagging @anthonysena for visibility.

However, I'm happy to look at adopting this change into the main code. Just submit a PR and I will gladly test it out.

cebarboza commented 9 months ago

No worries and thanks for the response. I'll send you the PR as soon as possible.

I would argue that the function createConceptCountsTable() is already part of CohortDiagnostics' workflow.

We also used the variable useExternalConceptCountsTable inside CD, so it was more natural for us to develop this, avoiding any major modification.

katy-sadowski commented 2 months ago

Did anything ever come out of this? If not, I'd be willing to help out.

I'm curious why useExternalConceptCountsTable is no longer supported? Also wondering if we ever considered providing the option to leverage ACHILLES concept counts table here?

azimov commented 2 months ago

Did anything ever come out of this? If not, I'd be willing to help out.

I'm curious why useExternalConceptCountsTable is no longer supported? Also wondering if we ever considered providing the option to leverage ACHILLES concept counts table here?

This was a fork that I never got a PR for. I expect that this could be useful, especially in contexts where the Achilles counts are there. If you're interested in following through I will gladly work to get the PR included.

cebarboza commented 3 weeks ago

Hi @azimov I sent the PR for exporting the function to useExternalConceptCountsTable.

ablack3 commented 3 weeks ago

Also wondering if we ever considered providing the option to leverage ACHILLES concept counts table here?

Good idea @katy-sadowski!

@cebarboza and I started working on the idea to speed up CohortDiagnstics by using the pre-computed concept counts in Achilles.

So instead of building the concepts counts table CohortDiagnostics would have the option to use the data that is already available in Achilles.

Here is a SQL query that would pull the data for

-- note this is postgresql sql

-- Achilles analysis ids used for the concept counts table
-- condition: 400 (persons), 401 (standard concepts), 425 (source concepts)
-- drug: 700 (persons), 701 (standard concepts), 725 (source concepts)
-- procedure: 600 (persons), 601 (standard concepts), 625 (source concepts)
-- measurement: 1800 (persons), 1801 (standard concept), 1825 (source concepts)
-- observation: 800 (persons), 801 (standard concepts), 825 (source concepts)

SELECT
    q1.concept_id, 
    concept_count, 
    concept_subjects
INTO @concept_counts_table
FROM (
    SELECT 
        CAST(stratum_1 as int) as concept_id,
        count_value as concept_count
    FROM results.achilles_results
    WHERE analysis_id IN (401,601,701,801,1801,425,625,725,825,1825) AND stratum_1 != '0'
) q1
LEFT JOIN 
(
    SELECT 
        CAST(stratum_1 as INT) as concept_id,
        count_value as concept_subjects
    FROM results.achilles_results
    WHERE analysis_id IN (400,600,700,800,1800) AND stratum_1 != '0'
) q2
ON q1.concept_id = q2.concept_id

I ran this on IPCI and the CohortDiagnostics sql script on IPCI and compared the results.

image

df is created by the sql script in cohort diagnostics and df2 is coming from the sql pasted above.

Note there is a huge difference in the number of rows. This is likely because a lot of data in IPCI falls outside the observation period. Achilles does not count concepts outside observation period. Cohort diagnostics does include concepts outside observation period.

Another difference is the inclusion of concept id 0. this concept id occurs in multiple cdm tables and so is repeated in the concept table created by cohort diagnostics. I removed these rows. In the table created by cohort diagnostics there is no way to tell which domain the rows with concept id 0 came from.

image

@azimov, @katy-sadowski what do you think about adding achilles as an option for getting concept counts in cohort diagnostics?

Also what do you think about making the counts consistent by excluding concepts outside observation period and removing concept id 0?

ablack3 commented 3 weeks ago

I created a pr for use of the achilles tables. Not tested yet but wanted to get some feedback on the approach first. https://github.com/OHDSI/CohortDiagnostics/pull/1120