Atlas Estimation: problems with excluding concepts in propensity scoring

OHDSI / Atlas

ATLAS is an open source software tool for researchers to conduct scientific analyses on standardized observational data

http://atlas-demo.ohdsi.org/

Apache License 2.0

265 stars 131 forks source link

Atlas Estimation: problems with excluding concepts in propensity scoring #263

Closed ChristopheLambert closed 7 years ago

ChristopheLambert commented 7 years ago

I have two issues with excluding concepts:

1) The user interface does not seem to allow one to add more than a single excluded concept. The place where you can enter a single concept falls beneath the heading:

What concepts do you want to exclude from baseline covariates in the propensity score model? (Leave blank if you want to include everything)

At minimum we need to add the treatment and comparison drug concepts to exclude from the covariates for the propensity scoring. Is there a way to do this I am missing?

2) When I add one of the drugs using the interface above, the R code generated does not work. It passes a list of codes to a SQL query, that gives a SQL error, because a list of codes is not valid SQL. Here is the code emitted:

    # Get all Drug Lithium Concept IDs for exclusion:
    sql <- paste("19033464,40000557,40136946,751365,19101073,40000565,751247,19110551,19125230,751368,40000558,751369,19107689,19103182,19118197,751362,751364,19045199,40000564,751282,751291,751251,19033465,751370,19125229,19045707,40000560,19033437,19088232,40000524,751363,751367,19107688,751284,19064668,19019111,40000563,751249,19033439,19045709,751283,751246,19033438,40000556,40000562,19000962,40000561,751250,19061260")
    sql <- SqlRender::renderSql(sql, cdm_database_schema = cdmDatabaseSchema)$sql
    sql <- SqlRender::translateSql(sql, targetDialect = connectionDetails$dbms)$sql
    connection <- connect(connectionDetails)
    excludedConcepts <- querySql(connection, sql)
    excludedConcepts <- excludedConcepts$CONCEPT_ID

The code that correctly works should be much simpler -- simply create a list of excluded concepts to pass to the createCovariateSettings code:

excludedConcepts <- c(19033464,40000557,40136946,751365,19101073,40000565,751247,19110551,19125230,751368,40000558,751369,19107689,19103182,19118197,751362,751364,19045199,40000564,751282,751291,751251,19033465,751370,19125229,19045707,40000560,19033437,19088232,40000524,751363,751367,19107688,751284,19064668,19019111,40000563,751249,19033439,19045709,751283,751246,19033438,40000556,40000562,19000962,40000561,751250,19061260)

chrisknoll commented 7 years ago

Hi, yes that's a bug, it supposed to put the concept set sql in that query, but instead it's a comma delimited list of IDs.

anthonysena commented 7 years ago

Hi Christophe: yes, this appears to be a bug related to the following WebAPI service which should return SQL to obtain the concepts that make up the concept set you've selected in the estimation specification:

http:///WebAPI//vocabulary/conceptSetExpressionSQL

I was looking at an estimation project on OHDSI.org to see if I could replicate this bug: http://www.ohdsi.org/web/atlas/#/estimation/3112. It appears that the SQL is being returned properly as shown here:

    # Get all AEDs Concept IDs for inclusion:
    sql <- paste("select distinct I.concept_id FROM
( 
  select concept_id from @cdm_database_schema.CONCEPT where concept_id in (19018520,40239995,795661,797399,19087394,705103,711584,42904177,19023286,40102084,734354,19000921,742267,744798)and invalid_reason is null

) I
")
    sql <- SqlRender::renderSql(sql, cdm_database_schema = cdmDatabaseSchema)$sql
    sql <- SqlRender::translateSql(sql, targetDialect = connectionDetails$dbms)$sql
    connection <- connect(connectionDetails)
    includedConcepts <- querySql(connection, sql)
    includedConcepts <- includedConcepts$CONCEPT_ID

Can you see if the web service above is being called when the estimation project is loading? You can see this in the Chrome Developer tools (under network and filter to XHR requests).

To your other point: yes, constructing a data frame with all of the concept IDs would be a much simpler approach and it is one that I had initially used when developing that module. However, that approach breaks down when you have a concept set that contains a large number of concepts (I believe > 10,000). When I attempted to construct a data frame with that many concepts, R would hang. So, I changed this to execute the SQL for the concept set since the length of the SQL command would be something that should be relatively short and would allow for returning a larger number of concepts when needed.

ChristopheLambert commented 7 years ago

Hi Anthony: conceptSetExpressionSQL is listed as having run.

Due to another issue, I installed a newer version of cohortMethod today, and the R code generated now seems correct:

    # Get all Drug Lithium Concept IDs for exclusion:
    sql <- paste("select distinct I.concept_id FROM
( 
  select concept_id from @cdm_database_schema.CONCEPT where concept_id in (751246)and invalid_reason is null
UNION  select c.concept_id
  from @cdm_database_schema.CONCEPT c
  join @cdm_database_schema.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (751246)
  and c.invalid_reason is null

) I
")
    sql <- SqlRender::renderSql(sql, cdm_database_schema = cdmDatabaseSchema)$sql
    sql <- SqlRender::translateSql(sql, targetDialect = connectionDetails$dbms)$sql
    connection <- connect(connectionDetails)
    excludedConcepts <- querySql(connection, sql)
    excludedConcepts <- excludedConcepts$CONCEPT_ID

Atlas doesn't use R, does it? I'm confused how updating some R code could change this, or whether it is due to something else. I did not update WebAPI.

Any response to my question on how to specify multiple variables to exclude from the propensity score model using the Atlas user interface?

chrisknoll commented 7 years ago

Something must have changed, I saw this exact problem when I was doing my own estimation project with an excluded concept set, and instead of getting a SQL for the concept set, it returned the included concepts. I can't really parse out the commit history under atlas tho to really understand what change might have gone in to fix that, so unless someone can claim responsibility for the fix, it's a mystery to me.

chrisknoll commented 7 years ago

So digging into this further, I see some oddly conflicting code. In line 705 of cohort-comparisonmanager.js, i see:

vocabularyAPI.resolveConceptSetExpression(csExpression).then(
function (data) {
    self.targetConceptIds(data);
});

note the self.targetConceptIds() assignment to the results of resolveConceptSetExpression()

Elsewhere, i see this on line 786 of the same file:

self.choosePsExclusion = function () {
    $('cohort-comparison-manager #modalConceptSet').modal('show');
    self.targetId = self.cohortComparison().psExclusionId;
    self.targetCaption = self.cohortComparison().psExclusionCaption;
    self.targetExpression = self.cohortComparison().psExclusionConceptSet;
    self.targetConceptIds = self.cohortComparison().psExclusionConceptSetSQL;
}

Here we have self.targetConceptIds being assigned to cohortComparison's psExclusionConceptSetSQL. So is there a path through the code where resolveConceptSetExpression is invoked later than the getConceptSetSQL? This could just be a race condition of some sort, but it seems odd to me that sometimes targetCocneptIDs is a list of conceptIDs and other times it's the psExclusionConceptSetSQL.

...AH HA! I got it:

Steps to reproduce:

Open up an estimation with a concept set set in the PS exclusion
Remove that concept set from PS Exclusion
Add the concept set back as the PS Exclusion.

Expected result: the concept set expression SQL should be in the generated R code.

Actual Results: the concept set expression sql is rendered as the included concepts.

So this is a bug but only happens when you first assign a concept set. It seems if you save it and reload it, it will resolve the concept set expression to the SQL and not the included concepts.

t-abdul-basser commented 7 years ago

Good catch!

Get Outlook for Androidhttps://aka.ms/ghei36

On Wed, Oct 19, 2016 at 9:11 PM -0400, "Chris Knoll" notifications@github.com<mailto:notifications@github.com> wrote:

So digging into this further, I see some oddly conflicting code. In line 705 of cohort-comparisonmanager.js, i see:

vocabularyAPI.resolveConceptSetExpression(csExpression).then( function (data) { self.targetConceptIds(data); });

note the self.targetConceptIds() assignment to the results of resolveConceptSetExpression()

Elsewhere, i see this on line 786 of the same file:

self.choosePsExclusion = function () { $('cohort-comparison-manager #modalConceptSet').modal('show'); self.targetId = self.cohortComparison().psExclusionId; self.targetCaption = self.cohortComparison().psExclusionCaption; self.targetExpression = self.cohortComparison().psExclusionConceptSet; self.targetConceptIds = self.cohortComparison().psExclusionConceptSetSQL; }

...AH HA! I got it:

Steps to reproduce:

Open up an estimation with a concept set set in the PS exclusion
Remove that concept set from PS Exclusion
Add the concept set back as the PS Exclusion.

Expected result: the concept set expression SQL should be in the generated R code.

Actual Results: the concept set expression sql is rendered as the included concepts.

So this is a bug but only happens when you first assign a concept set. It seems if you save it and reload it, it will resolve the concept set expression to the SQL and not the included concepts.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/OHDSI/Atlas/issues/263#issuecomment-254983874, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIbcsOeav5dwgYtaEh6scj6hYkNY6Cthks5q1r-8gaJpZM4KaawG.

anthonysena commented 7 years ago

Thanks @chrisknoll for tracking down that bug. I'll work to get that fixed.

@ChristopheLambert - I think the behavior you saw in the R code is directly related to the steps that Chris mentioned earlier. To your question about how to specify multiple variables to exclude from the propensity score model: this is being done when you specify a concept set under the heading:

What concepts do you want to exclude from baseline covariates in the propensity score model? (Leave blank if you want to include everything)

If you look at the resulting code, you'll see the creation of the excludedConcepts vector containing the IDs to exclude. From your example:

 # Get all Drug Lithium Concept IDs for exclusion:
    sql <- paste("select distinct I.concept_id FROM
( 
  select concept_id from @cdm_database_schema.CONCEPT where concept_id in (751246)and invalid_reason is null
UNION  select c.concept_id
  from @cdm_database_schema.CONCEPT c
  join @cdm_database_schema.CONCEPT_ANCESTOR ca on c.concept_id = ca.descendant_concept_id
  and ca.ancestor_concept_id in (751246)
  and c.invalid_reason is null

) I
")
    sql <- SqlRender::renderSql(sql, cdm_database_schema = cdmDatabaseSchema)$sql
    sql <- SqlRender::translateSql(sql, targetDialect = connectionDetails$dbms)$sql
    connection <- connect(connectionDetails)
    excludedConcepts <- querySql(connection, sql)
    excludedConcepts <- excludedConcepts$CONCEPT_ID

Then a little further along in the R code, you will see the following:


# Define which types of covariates must be constructed:
    covariateSettings <- createCovariateSettings(.......<settings removed for brevity>
                                         excludedCovariateConceptIds = excludedConcepts,
                                         includedCovariateConceptIds = includedConcepts,
                                         deleteCovariatesSmallCount = 100)

The covariateSettings are then used to retrieve and calculate the baseline covariates. As shown above, we've told CohortMethod to exclude the covariates that are specified in the excludedConcepts vector. As a result, they will not be included in the propensity score model. Does this help to answer your question?

ChristopheLambert commented 7 years ago

Thanks, @anthonysena -- I realize that we can munge a bunch of terms together under a new concept set to create the excluded concepts for propensity scoring, but this seems to violate the concept of creating modular sets of concepts and reusing them. For each pair of drugs I compare, I would have to create individual concept sets for drug1 and drug2 for use in defining my cohorts, plus a combined set of drug1+drug2 in order to add them to the exclusion list. I propose a feature that you instead allow multi-selection of as many concept sets as you like. Often you also want to exclude the outcome set as well in propensity scoring, requiring a drug1+drug2+outcome concept set. It starts to get messy.

anthonysena commented 7 years ago

That makes good sense @ChristopheLambert - I've opened a new issue #269 to handle this use case. I'll keep this issue open so that I can fix the bug that @chrisknoll was able to reproduce.