OHDSI / Atlas

ATLAS is an open source software tool for researchers to conduct scientific analyses on standardized observational data
http://atlas-demo.ohdsi.org/
Apache License 2.0
266 stars 136 forks source link

Different cohort result with Atlas generation and with manual SQL execution #2921

Closed frpozzoni closed 6 months ago

frpozzoni commented 6 months ago

Expected behavior

When running a cohort generation with a simple cohort definition I expect getting the same results both by running it in Atlas and by exporting sql code and running it manually on the database.

Actual behavior

This doesn't happen in two of our installations with the following characteristics:

The number of patients gave by Atlas is 0 while it is 4 if the cohort is generated manually via sql (note that this second result is the expected one).

Steps to reproduce behavior

Import the following cohort definition:

{
  "ConceptSets": [
    {
      "id": 0,
      "name": "Body height",
      "expression": {
        "items": [
          {
            "concept": {
              "CONCEPT_CLASS_ID": "Clinical Observation",
              "CONCEPT_CODE": "8302-2",
              "CONCEPT_ID": 3036277,
              "CONCEPT_NAME": "Body height",
              "DOMAIN_ID": "Measurement",
              "INVALID_REASON": "V",
              "INVALID_REASON_CAPTION": "Valid",
              "STANDARD_CONCEPT": "S",
              "STANDARD_CONCEPT_CAPTION": "Standard",
              "VOCABULARY_ID": "LOINC"
            }
          }
        ]
      }
    }
  ],
  "PrimaryCriteria": {
    "CriteriaList": [
      {
        "Measurement": {
          "CodesetId": 0,
          "ValueAsNumber": {
            "Value": 110,
            "Extent": 230,
            "Op": "bt"
          },
          "Unit": [
            {
              "CONCEPT_CODE": "cm",
              "CONCEPT_ID": 8582,
              "CONCEPT_NAME": "centimeter",
              "DOMAIN_ID": "Unit",
              "INVALID_REASON_CAPTION": "Unknown",
              "STANDARD_CONCEPT_CAPTION": "Unknown",
              "VOCABULARY_ID": "UCUM"
            }
          ]
        }
      }
    ],
    "ObservationWindow": {
      "PriorDays": 0,
      "PostDays": 0
    },
    "PrimaryCriteriaLimit": {
      "Type": "First"
    }
  },
  "QualifiedLimit": {
    "Type": "First"
  },
  "ExpressionLimit": {
    "Type": "First"
  },
  "InclusionRules": [],
  "CensoringCriteria": [],
  "CollapseSettings": {
    "CollapseType": "ERA",
    "EraPad": 0
  },
  "CensorWindow": {},
  "cdmVersionRange": ">=5.0.0"
}

Generate the cohort directly in Atlas and execute it manually by downloading from Atlas the corresponding sql files. Compare the results.

Bonus info

If we add an additional constraint on the index event date (eg. events after 2018) both atlas and sql results are aligned.

The following json has the time constraint and works both with Atlas and sql files.

{
  "ConceptSets": [
    {
      "id": 0,
      "name": "Body height",
      "expression": {
        "items": [
          {
            "concept": {
              "CONCEPT_CLASS_ID": "Clinical Observation",
              "CONCEPT_CODE": "8302-2",
              "CONCEPT_ID": 3036277,
              "CONCEPT_NAME": "Body height",
              "DOMAIN_ID": "Measurement",
              "INVALID_REASON": "V",
              "INVALID_REASON_CAPTION": "Valid",
              "STANDARD_CONCEPT": "S",
              "STANDARD_CONCEPT_CAPTION": "Standard",
              "VOCABULARY_ID": "LOINC"
            }
          }
        ]
      }
    }
  ],
  "PrimaryCriteria": {
    "CriteriaList": [
      {
        "Measurement": {
          "CodesetId": 0,
          "OccurrenceStartDate": {
            "Value": "2018-01-01",
            "Op": "gte"
          },
          "ValueAsNumber": {
            "Value": 110,
            "Extent": 230,
            "Op": "bt"
          },
          "Unit": [
            {
              "CONCEPT_CODE": "cm",
              "CONCEPT_ID": 8582,
              "CONCEPT_NAME": "centimeter",
              "DOMAIN_ID": "Unit",
              "INVALID_REASON_CAPTION": "Unknown",
              "STANDARD_CONCEPT_CAPTION": "Unknown",
              "VOCABULARY_ID": "UCUM"
            }
          ]
        }
      }
    ],
    "ObservationWindow": {
      "PriorDays": 0,
      "PostDays": 0
    },
    "PrimaryCriteriaLimit": {
      "Type": "First"
    }
  },
  "QualifiedLimit": {
    "Type": "First"
  },
  "ExpressionLimit": {
    "Type": "First"
  },
  "InclusionRules": [],
  "CensoringCriteria": [],
  "CollapseSettings": {
    "CollapseType": "ERA",
    "EraPad": 0
  },
  "CensorWindow": {},
  "cdmVersionRange": ">=5.0.0"
}
chrisknoll commented 6 months ago

Could be something with caching. Without being able to look at patient level data or being able to step through to see the sql that is executed (tho: the export SQL is exactly the same code that would be used when running the atlas job).

but, you may want to 'clear the cohort cache' by deleting records from that table where type=cohort:

delete from webapi.generation_cache where type = 'COHORT';
frpozzoni commented 6 months ago

That did the trick, thank you @chrisknoll!

Is there a way to configure the cache to expire daily or weekly, ideally timed to coincide with our ETL pipeline executions, or do we need to manage cache clearing within the ETL pipeline itself?

chrisknoll commented 6 months ago

You can add these settings to your settings.xml when you build your webAPI. If you use docker, I don't know how you pass these settings down to the container:

    <cache.generation.invalidAfterDays>30</cache.generation.invalidAfterDays>
    <cache.generation.cleanupInterval>3600000</cache.generation.cleanupInterval>

But I think I'd manually delete from the table as part of your ETL process just so you know that you're synced up. There is a lag between the job running and your ETL refresh.

Better yet, I'd avoid 're-using' sources: WebAPI assumes that source data doesn't change over time. Instead I'd create a new source (make your source keys represent some version like _v1..._v2...._v3) such that when you do a new ETL, you create a new source, give it the same settings as the old one, but with a new source key. you'll see in the app that there's 2 sources when you add the new one, when you are happy that the ETL looks refreshed, remove the prior version source.

I'm not sure how often you refresh and if this would be a burdeon but if you are going to stick with using 1 source and refresh the data under the source, you will need to clear the achilles_cache and cdm_cache tables as well.