OHDSI / OhdsiShinyModules

An R package containing Shiny modules used by various OHDSI Shiny apps
https://ohdsi.github.io/OhdsiShinyModules/
8 stars 11 forks source link

improve mapped concept set query #299

Open ablack3 opened 3 months ago

ablack3 commented 3 months ago

The mapped concept set query is very slow on sqlite when the concept set is large sometimes taking minutes to complete.

https://github.com/OHDSI/OhdsiShinyModules/blob/42f5bb42c731fd9ff1a9c96bb68b624fd357ba78/R/cohort-diagnostics-shared.R#L727

It can be improved by adding indexes and performing the final join in R instead of the database.

@cebarboza and I were working on it.

Indexes drastically sped up this part of the query

WITH resolved_concepts_mapped
    AS (
        SELECT concept_sets.concept_id AS resolved_concept_id,
            c1.concept_id,
            c1.concept_name,
            c1.domain_id,
            c1.vocabulary_id,
            c1.concept_class_id,
            c1.standard_concept,
            c1.concept_code
        FROM (
            SELECT DISTINCT concept_id
            FROM @schema.@resolved_concepts
            WHERE database_id IN (@databaseIds)
                AND cohort_id = @cohort_id
            ) concept_sets
        INNER JOIN @schema.@concept_relationship cr ON concept_sets.concept_id = cr.concept_id_2
        INNER JOIN @schema.@concept c1 ON cr.concept_id_1 = c1.concept_id
        WHERE relationship_id = 'Maps to'
            AND standard_concept IS NULL
        )

However the final join was still slow. It is a many to many join that multiplies the rows of concept table (result from above) for every combination of database, concept set, and cohort.

 SELECT
        c.database_id,
        c.cohort_id,
        c.concept_set_id,
        mapped.*
    FROM (SELECT DISTINCT concept_id, database_id, cohort_id, concept_set_id FROM @schema.@resolved_concepts) c
    INNER JOIN resolved_concepts_mapped mapped ON c.concept_id = mapped.resolved_concept_id
    {@cohort_id != ''} ? { WHERE c.cohort_id = @cohort_id};

However when we downloaded this SELECT DISTINCT concept_id, database_id, cohort_id, concept_set_id FROM @schema.@resolved_concepts) into R and then did the join it was very fast.

Perhaps there is a way to speed the whole query up in SQL. For us adding indexes and splitting the query into two and doing the final join in R did fix the slowness in the shiny app.

I was also thinking that the app might not really need to do this final join since it duplicates a lot of data. In any case we are recording our findings in this issue and can make a PR if the approach we came up with seems reasonable.

The proposal would be to optimize the mappedConceptSet function by adding indexes if they are not there (for sqlite only since I'm not sure if adding indexes really works across all dbms), split the query into two parts as shown above, and then do the final join in R instead of in sql.

ablack3 commented 1 month ago

Adding these indexs to the sqlite database improves the speed and user experience of the cohort diagnostics shiny app.

DBI::dbExecute(con, "CREATE INDEX idx_resolved_concepts ON resolved_concepts (concept_id, database_id, cohort_id, concept_set_id);")
DBI::dbExecute(con, "CREATE INDEX idx_concept_relationship ON concept_relationship (concept_id_2, relationship_id);")
DBI::dbExecute(con, "CREATE INDEX idx_concept_main ON concept (concept_id, standard_concept);")

We could also consider pre-joining this data once when we build the sqlite database rather than joining in the database while the use is interacting with the app. It would add some columns to the resolved_concepts table so would require more disk space but might be better for the app's user experience.