Issues with IR person counts and cases output

rkboyce commented 1 year ago

Expected behavior

For the purposes of explaining our issues, I will be referring to one specific cohort group, which includes the first incidence of each individual over 65 years of age who received a prescription for a sulfonylurea medication.

Our dataset contains EHR data for 646,785 individuals. We have developed a number of cohort definitions intended to capture cases of individuals receiving inappropriate medications based on age and a number of other variable factors.

For our incidence rates, each specific drug cohort represents an outcome cohort, with the target cohort in every case being the first incidence of any EHR prescription per individual. For example, the sulfonylurea cohort definition suggests that we should see no more than 11,717 cases of first sulfonylurea exposure; the target cohort number should be 333,442 individuals.

Actual behavior

When we run the incidence rates, we see person counts approaching, or in some cases, greater than 1 million. As our target cohort only contains 333,442 individuals, that should be the person count for each outcome cohort. However, it varies with each outcome cohort; for sulfonylureas, the person count is 1,198,464, which is impossible considering the total number of individuals in our dataset.

As far as cases are concerned, the total case numbers do not reflect what we see from the cohort definitions, either. For sulfonylureas, the IR output states there are 13,776 cases, while the cohort definition reflects only 11,717.

These issues also appear in the .tsv output as well.

We think this is a possible bug because we are running Atlas and WebAPI 2.13. We think it might be related to attempts to do fixes as mentioned in tickets github.com/https://github.com/OHDSI/WebAPI/pull/2180 and https://github.com/OHDSI/WebAPI/issues/2009

thanks.

chrisknoll commented 1 year ago

Thanks for the report, this is very strange, usually we see numbers less than the cohort count (because prior outcomes exclude people from analysis). I don't think the PR you reference here is the culprit because it limits the Target population (ie: the time at risk) to 1 event per person).

What I'd like to ask is that if you could export your IR design as JSON and attach it here. I can import it into my env and extract the analysis SQL and post it back here so that you can try to execute it directly on your DB. It should reveal how we are finding more people with cases than in your T.

rkboyce commented 1 year ago

Thank you for taking time to look into this. Here is the JSON from the ir_analysis_details table - please let me know if this is what you needed:

ir_analysis_43_202307131013.txt

chrisknoll commented 1 year ago

@rkboyce , i was expecting a simple analysis that demonstrates the issue, but the analysis you sent me is about 7MB in size...could you please narrow it down to the smallest possible design that demonstrates the issue?

In addition, the file you sent me starts with invalid JSON. The file attached starts with:

{ConceptSets":[{"id":0

But, it should be

{"ConceptSets":[{"id":0

is that what came out of the atlas UI?

I did a local test and the export should have attbitues, it looks like you might have tried to copy out the internal expression:

{
  "hasWriteAccess": false,
  "id": 4,
  "name": "Test IR",
  "description": "This is a description!",
  "expression": "{\"ConceptSets\":[],\"targetIds\":[5],\"outcomeIds\":[6],\"timeAtRisk\":{\"start\":{\"DateField\":\"StartDate\",\"Offset\":0},\"end\":{\"DateField\":\"StartDate\",\"Offset\":30}},\"studyWindow\":null,\"strata\":[],\"targetCohorts\":[{\"id\":5,\"name\":\"NASIDS\",\"hasWriteAccess\":false,\"expressionType\":\"SIMPLE_EXPRESSION\",\"expression\":{\"cdmVersionRange\":\">=5.0.0\",\"PrimaryCriteria\":{\"CriteriaList\":[{\"DrugEra\":{\"CodesetId\":0}}],\"ObservationWindow\":{\"PriorDays\":0,\"PostDays\":0},\"PrimaryCriteriaLimit\":{\"Type\":\"All\"}},\"ConceptSets\":[{\"id\":0,\"name\":\"NSAIDs\\n\",\"expression\":{\"items\":[{\"concept\":{\"CONCEPT_ID\":1118084,\"CONCEPT_NAME\":\"celecoxib\",\"STANDARD_CONCEPT\":\"S\",\"STANDARD_CONCEPT_CAPTION\":\"Standard\",\"INVALID_REASON\":\"V\",\"INVALID_REASON_CAPTION\":\"Valid\",\"CONCEPT_CODE\":\"140587\",\"DOMAIN_ID\":\"Drug\",\"VOCABULARY_ID\":\"RxNorm\",\"CONCEPT_CLASS_ID\":\"Ingredient\"},\"isExcluded\":false,\"includeDescendants\":true,\"includeMapped\":false},{\"concept\":{\"CONCEPT_ID\":1124300,\"CONCEPT_NAME\":\"Diclofenac\",\"STANDARD_CONCEPT\":\"S\",\"STANDARD_CONCEPT_CAPTION\":\"Standard\",\"INVALID_REASON\":\"V\",\"INVALID_REASON_CAPTION\":\"Valid\",\"CONCEPT_CODE\":\"3355\",\"DOMAIN_ID\":\"Drug\",\"VOCABULARY_ID\":\"RxNorm\",\"CONCEPT_CLASS_ID\":\"Ingredient\"},\"isExcluded\":false,\"includeDescendants\":true,\"includeMapped\":false}]}}],\"QualifiedLimit\":{\"Type\":\"First\"},\"ExpressionLimit\":{\"Type\":\"All\"},\"InclusionRules\":[],\"EndStrategy\":{\"DateOffset\":{\"DateField\":\"EndDate\",\"Offset\":0}},\"CensoringCriteria\":[],\"CollapseSettings\":{\"CollapseType\":\"ERA\",\"EraPad\":0},\"CensorWindow\":{}}}],\"outcomeCohorts\":[{\"id\":6,\"name\":\"GI Bleed\",\"hasWriteAccess\":false,\"expressionType\":\"SIMPLE_EXPRESSION\",\"expression\":{\"cdmVersionRange\":\">=5.0.0\",\"PrimaryCriteria\":{\"CriteriaList\":[{\"ConditionOccurrence\":{\"CodesetId\":0,\"ConditionTypeExclude\":false}}],\"ObservationWindow\":{\"PriorDays\":0,\"PostDays\":0},\"PrimaryCriteriaLimit\":{\"Type\":\"All\"}},\"ConceptSets\":[{\"id\":0,\"name\":\"Gastrointestinal hemorrhage\",\"expression\":{\"items\":[{\"concept\":{\"CONCEPT_ID\":192671,\"CONCEPT_NAME\":\"Gastrointestinal hemorrhage\",\"STANDARD_CONCEPT\":\"S\",\"STANDARD_CONCEPT_CAPTION\":\"Standard\",\"INVALID_REASON\":\"V\",\"INVALID_REASON_CAPTION\":\"Valid\",\"CONCEPT_CODE\":\"74474003\",\"DOMAIN_ID\":\"Condition\",\"VOCABULARY_ID\":\"SNOMED\",\"CONCEPT_CLASS_ID\":\"Clinical Finding\"},\"isExcluded\":false,\"includeDescendants\":true,\"includeMapped\":false}]}}],\"QualifiedLimit\":{\"Type\":\"First\"},\"ExpressionLimit\":{\"Type\":\"All\"},\"InclusionRules\":[],\"EndStrategy\":{\"DateOffset\":{\"DateField\":\"StartDate\",\"Offset\":1}},\"CensoringCriteria\":[],\"CollapseSettings\":{\"CollapseType\":\"ERA\",\"EraPad\":7},\"CensorWindow\":{}}}]}"
}

chrisknoll commented 1 year ago

I am able to export a 'template' IR analysis query that you can run, but I'd like to give it to you in the correct dialect. What DBMS are you using, and I'll provide you a template IR analysis query that you can substitute your local variables into it to see where the query is returning cases > target pop.

rkboyce commented 1 year ago

@chrisknoll - I was able to generate the SQL using my installation for one target cohort X one outcome cohort and then go through it. I think I might have found an issue that could be underlying this.

In our database, we use two ways to calculate observation periods coded as follows:

44814724    Period covering healthcare encounters
44814725    Period inferred by algorithm

The effect on the IR queries seems to be duplication of events. Here is where I see duplicates across the tables created during IR SQL queries:
- cteRawData e.g.,

target_id   outcome_id  subject_id  cohort_start_date   cohort_end_date time_at_risk    is_case
65  12  101 2015-02-18  2016-10-05  76  1
65  12  101 2015-02-18  2016-10-05  76  1

time_at_risk e.g.,

target_id   outcome_id  subject_id  cohort_start_date   cohort_end_date time_at_risk    is_case
65  12  101 2015-02-18  2016-10-05  76  1
65  12  101 2015-02-18  2016-10-05  76  1

analysis_events e.g.,

event_id    person_id   start_date  end_date    op_start_date   op_end_date target_concept_id   visit_occurrence_id
1   8440000284  2018-05-08  2018-12-31  2018-05-08  2018-05-29      
2   8440000284  2018-05-08  2018-12-31  2018-05-08  2018-12-31

strataCohorts e.g.,

strata_sequence person_id   event_id
0   8440000284  1
0   8440000284  2

This query that creates the results might need a distinct on the subject_id

INSERT INTO webapi_results_cases.ir_analysis_result (analysis_id, target_id, outcome_id, strata_mask, person_count, time_at_risk, cases)
select 48 as analysis_id, T.target_id, T.outcome_id, CAST(E.strata_mask AS bigint),
  COUNT(subject_id) as person_count,              -- ***** should this be distinct? *****
  CAST(ROUND(CAST(sum(1.0 * time_at_risk / 365.25) AS NUMERIC),0) AS BIGINT) as time_at_risk,
  sum(is_case) as cases
from time_at_risk T

Results

analysis_id target_id   outcome_id  strata_mask person_count    time_at_risk    cases
48  65  12  0   146663  370730  82
48  65  12  1   26106   71949   40

I reran all of the queries (shown below with my changes starred) and got different results that look more correct

analysis_id target_id   outcome_id  strata_mask person_count    time_at_risk    cases
48  65  12  0   1010    2621    1
48  65  12  1   465 1316    0

------ ATTEMPTED FIX

CREATE TEMP TABLE cohorts
AS
SELECT
cohort_id, is_outcome
FROM
(
  SELECT 65 as cohort_id, 0 as is_outcome
UNION
SELECT 12 as cohort_id, 1 as is_outcome
) C
;
ANALYZE cohorts
;
CREATE TEMP TABLE cteCohortCombos
AS
SELECT
t.cohort_id as target_id, o.cohort_id as outcome_id
FROM
cohorts t
CROSS JOIN cohorts o
where t.is_outcome = 0 and o.is_outcome = 1;
ANALYZE cteCohortCombos
;
CREATE TEMP TABLE cteCohortData
AS
SELECT
combos.target_id, combos.outcome_id, t.subject_id, t.cohort_start_date, t.cohort_end_date, t.adjusted_start_date, t.adjusted_end_date, op.observation_period_start_date as op_start_date, op.observation_period_end_date as op_end_date
FROM
cteCohortCombos combos
join (
  select cohort_definition_id, subject_id, cohort_start_date, cohort_end_date, (cohort_start_date + 0*INTERVAL'1 day') as adjusted_start_date, (cohort_end_date + 0*INTERVAL'1 day') as adjusted_end_date
  FROM (
    select cohort_definition_id, subject_id, cohort_start_date, cohort_end_date, row_number() over (partition by subject_id, cohort_definition_id order by cohort_start_date ASC) as ordinal
    FROM webapi_results_cases.cohort
    where cohort_definition_id in (select cohort_id from cohorts where is_outcome = 0)
  ) d
  where d.ordinal = 1
) t on t.cohort_definition_id = combos.target_id
join cdm_pace_cases.observation_period op on t.subject_id = op.person_id and t.cohort_start_date between op.observation_period_start_date and op.observation_period_end_date
              and op.period_type_concept_id = 44814724  -- ******* Just picking one type of op 
left join (
  select cohort_definition_id, subject_id, min(cohort_start_date) as cohort_start_date
  from webapi_results_cases.cohort
  GROUP BY cohort_definition_id, subject_id
) O on o.cohort_definition_id = combos.outcome_id
  and t.subject_id = o.subject_id
where (o.cohort_start_date is NULL or o.cohort_start_date > t.adjusted_start_date)
  and t.adjusted_start_date < t.adjusted_end_date
  and t.adjusted_start_date between op.observation_period_start_date and op.observation_period_end_date
  AND t.cohort_start_date >= TO_DATE(TO_CHAR(2015,'0000')||'-'||TO_CHAR(1,'00')||'-'||TO_CHAR(1,'00'), 'YYYY-MM-DD') AND t.cohort_start_date <= TO_DATE(TO_CHAR(2018,'0000')||'-'||TO_CHAR(12,'00')||'-'||TO_CHAR(31,'00'), 'YYYY-MM-DD');
ANALYZE cteCohortData
;

CREATE TEMP TABLE cteEndDates
AS
SELECT
target_id, outcome_id, subject_id, cohort_start_date, followup_end, is_case
FROM
(
  select target_id, outcome_id, subject_id, cohort_start_date, followup_end, is_case, row_number() over (partition by target_id, outcome_id, subject_id, cohort_start_date order by followup_end asc, is_case desc) as RN
  FROM (
    select combos.target_id, combos.outcome_id, t.subject_id, t.cohort_start_date, t.op_end_date as followup_end, 0 as is_case
    from cteCohortCombos combos
    join cteCohortData t on combos.target_id = t.target_id and combos.outcome_id = t.outcome_id
    UNION
    select combos.target_id, combos.outcome_id, t.subject_id, t.cohort_start_date, t.adjusted_end_date as followup_end, 0 as is_case
    from cteCohortCombos combos
    join cteCohortData t on combos.target_id = t.target_id and combos.outcome_id = t.outcome_id
    UNION
    select combos.target_id, combos.outcome_id, t.subject_id, t.cohort_start_date, o.cohort_start_date as followup_end, 1 as is_case
    from cteCohortCombos combos
    join cteCohortData t on combos.target_id = t.target_id and combos.outcome_id = t.outcome_id
    join (
      select cohort_definition_id, subject_id, min(cohort_start_date) as cohort_start_date
      from webapi_results_cases.cohort
      GROUP BY cohort_definition_id, subject_id
    ) O on o.cohort_definition_id = combos.outcome_id and t.subject_id = o.subject_id
    where o.cohort_start_date > t.adjusted_start_date
    UNION
select combos.target_id, combos.outcome_id, t.subject_id, t.cohort_start_date, TO_DATE(TO_CHAR(2018,'0000')||'-'||TO_CHAR(12,'00')||'-'||TO_CHAR(31,'00'), 'YYYY-MM-DD') as followup_end, 0 as is_case
FROM cteCohortCombos combos
JOIN  cteCohortData t on combos.target_id = t.target_id and combos.outcome_id = t.outcome_id
  ) RawData
) Result
WHERE Result.RN = 1;
ANALYZE cteEndDates
;

CREATE TEMP TABLE cteRawData
AS
SELECT
t.target_id, t.outcome_id, t.subject_id, t.cohort_start_date, t.cohort_end_date, (CAST(e.followup_end AS DATE) - CAST(t.adjusted_start_date AS DATE)) as time_at_risk, e.is_case
FROM
cteCohortData t
join cteEndDates e on t.target_id = e.target_id
  and t.outcome_id = e.outcome_id
  and t.subject_id = e.subject_id
  and t.cohort_start_date = e.cohort_start_date;
ANALYZE cteRawData
;

CREATE TEMP TABLE time_at_risk
AS
SELECT
target_id, outcome_id, subject_id, cohort_start_date, cohort_end_date, time_at_risk, is_case
FROM
cteRawData;
ANALYZE time_at_risk
;

-- from here, take all the people's person_id, start_date, end_date, create an 'events table'
CREATE TEMP TABLE analysis_events  (event_id BIGINT,
  person_id BIGINT,
  start_date DATE,
  end_date DATE,
  op_start_date DATE,
  op_end_date DATE,
  TARGET_CONCEPT_ID BIGINT,
  visit_occurrence_id BIGINT
);
INSERT INTO analysis_events (event_id, person_id, start_date, end_date, op_start_date, op_end_date, TARGET_CONCEPT_ID, visit_occurrence_id)
select row_number() over (partition by P.person_id order by P.start_date) as event_id, P.person_id, P.start_date, P.end_date, P.op_start_date, P.op_end_date, CAST(NULL as bigint) as TARGET_CONCEPT_ID, CAST(NULL as bigint) as visit_occurrence_id
FROM
(
  select distinct T.subject_id as person_id, T.cohort_start_date as start_date, T.cohort_end_date as end_date, OP.observation_period_start_date as op_start_date, OP.observation_period_end_date as op_end_date
  from time_at_risk T
  JOIN cdm_pace_cases.observation_period OP on T.subject_id = OP.person_id and T.cohort_start_date between OP.observation_period_start_date and OP.observation_period_end_date
           and OP.period_type_concept_id = 44814724  -- *********** Picking just one type of OP 
) P
;

-- create the stratifiction set
CREATE TEMP TABLE Codesets  (codeset_id int NOT NULL,
  concept_id bigint NOT NULL
)
;
INSERT INTO Codesets (codeset_id, concept_id)
SELECT 0 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from vocabulary.CONCEPT where concept_id in (42538860)
) I
) C
;
CREATE TEMP TABLE strataCohorts 
 (strata_sequence int,
  person_id bigint,
  event_id bigint
)
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 0 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE (EXTRACT(YEAR FROM E.start_date) - P.year_of_birth >= 65 and EXTRACT(YEAR FROM E.start_date) - P.year_of_birth <= 69)
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 1 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE (EXTRACT(YEAR FROM E.start_date) - P.year_of_birth >= 70 and EXTRACT(YEAR FROM E.start_date) - P.year_of_birth <= 74)
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 2 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE (EXTRACT(YEAR FROM E.start_date) - P.year_of_birth >= 75 and EXTRACT(YEAR FROM E.start_date) - P.year_of_birth <= 79)
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 3 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE (EXTRACT(YEAR FROM E.start_date) - P.year_of_birth >= 80 and EXTRACT(YEAR FROM E.start_date) - P.year_of_birth <= 84)
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 4 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE EXTRACT(YEAR FROM E.start_date) - P.year_of_birth >= 85
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 5 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE P.race_concept_id in (8527) AND P.race_concept_id in (8527)
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 6 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Demographic Criteria
SELECT 0 as index_id, e.person_id, e.event_id
FROM analysis_events E
JOIN cdm_pace_cases.PERSON P ON P.PERSON_ID = E.PERSON_ID
WHERE P.gender_concept_id in (8507)
GROUP BY e.person_id, e.event_id
-- End Demographic Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 7 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Correlated Criteria
select 0 as index_id, cc.person_id, cc.event_id
from (SELECT p.person_id, p.event_id 
FROM analysis_events P
JOIN (
  -- Begin Payer Plan Period Criteria
select C.person_id, C.payer_plan_period_id as event_id, C.payer_plan_period_start_date as start_date, C.payer_plan_period_end_date as end_date,
       CAST(NULL as bigint) as visit_occurrence_id, C.payer_plan_period_start_date as sort_date
from
(
  select ppp.*, row_number() over (PARTITION BY ppp.person_id ORDER BY ppp.payer_plan_period_start_date) as ordinal
  FROM cdm_pace_cases.PAYER_PLAN_PERIOD ppp
) C
-- End Payer Plan Period Criteria
) A on A.person_id = P.person_id  AND A.START_DATE >= P.OP_START_DATE AND A.START_DATE <= P.OP_END_DATE AND A.START_DATE >= P.OP_START_DATE AND A.START_DATE <= P.OP_END_DATE ) cc 
GROUP BY cc.person_id, cc.event_id
HAVING COUNT(cc.event_id) >= 1
-- End Correlated Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;
INSERT INTO strataCohorts (strata_sequence, person_id, event_id)
select 8 as strata_id, person_id, event_id
FROM 
(
  select pe.person_id, pe.event_id
  FROM analysis_events pe
JOIN (
-- Begin Criteria Group
select 0 as index_id, person_id, event_id
FROM
(
  select E.person_id, E.event_id 
  FROM analysis_events E
  INNER JOIN
  (
    -- Begin Correlated Criteria
select 0 as index_id, cc.person_id, cc.event_id
from (SELECT p.person_id, p.event_id 
FROM analysis_events P
JOIN (
  -- Begin Measurement Criteria
select C.person_id, C.measurement_id as event_id, C.measurement_date as start_date, (C.measurement_date + 1*INTERVAL'1 day') as END_DATE,
       C.visit_occurrence_id, C.measurement_date as sort_date
from 
(
  select m.* 
  FROM cdm_pace_cases.MEASUREMENT m
JOIN Codesets cs on (m.measurement_concept_id = cs.concept_id and cs.codeset_id = 0)
) C
WHERE C.value_as_number >= 3.0000
-- End Measurement Criteria
) A on A.person_id = P.person_id  AND A.START_DATE >= P.OP_START_DATE AND A.START_DATE <= P.OP_END_DATE AND A.START_DATE >= P.OP_START_DATE AND A.START_DATE <= P.OP_END_DATE ) cc 
GROUP BY cc.person_id, cc.event_id
HAVING COUNT(cc.event_id) >= 1
-- End Correlated Criteria
  ) CQ on E.person_id = CQ.person_id and E.event_id = CQ.event_id
  GROUP BY E.person_id, E.event_id
  HAVING COUNT(index_id) = 1
) G
-- End Criteria Group
) AC on AC.person_id = pe.person_id AND AC.event_id = pe.event_id
) Results
;

-- join back the followup to the stratification and write to the results page.
DELETE FROM webapi_results_cases.ir_analysis_result where analysis_id = 48;
INSERT INTO webapi_results_cases.ir_analysis_result (analysis_id, target_id, outcome_id, strata_mask, person_count, time_at_risk, cases)
select 48 as analysis_id, T.target_id, T.outcome_id, CAST(E.strata_mask AS bigint),
  COUNT(distinct subject_id) as person_count,                                                                       -- ******* changed to distinct
  CAST(ROUND(CAST(sum(1.0 * time_at_risk / 365.25) AS NUMERIC),0) AS BIGINT) as time_at_risk,
  sum(is_case) as cases
from time_at_risk T
JOIN (
  select E.event_id, E.person_id, E.start_date, E.end_date, SUM(coalesce(POWER(cast(2 as bigint), SC.strata_sequence), 0)) as strata_mask
  FROM analysis_events E
  LEFT JOIN strataCohorts SC on SC.person_id = E.person_id and SC.event_id = E.event_id
  group by E.event_id, E.person_id, E.start_date, E.end_date
) E on T.subject_id = E.person_id and T.cohort_start_date = E.start_date and T.cohort_end_date = E.end_date
GROUP BY T.target_id, T.outcome_id, E.strata_mask
;
-- note in the case of no stratification (no rows in strataCohorts temp table), everyone will have strata_mask of 0.
-- calculate the individual strata counts from the raw person data. Rows from #strataCohorts are used to find counts for each strata
delete from webapi_results_cases.ir_analysis_strata_stats where analysis_id = 48;
insert into webapi_results_cases.ir_analysis_strata_stats (analysis_id, target_id, outcome_id, strata_sequence, person_count, time_at_risk, cases)
select irs.analysis_id, combos.target_id, combos.outcome_id, irs.strata_sequence, coalesce(T.person_count, 0) as person_count, CAST(coalesce(T.time_at_risk, 0) AS bigint) as time_at_risk, coalesce(T.cases, 0) as cases
from webapi_results_cases.ir_strata irs
cross join (
  select t.cohort_id as target_id, o.cohort_id as outcome_id
  FROM cohorts t
  CROSS JOIN cohorts o
  where t.is_outcome = 0 and o.is_outcome = 1 
) combos
left join
(
  select T.target_id, T.outcome_id, S.strata_sequence, count(S.event_id) as person_count, sum(1.0 * T.time_at_risk / 365.25) as time_at_risk, sum(T.is_case) as cases
  from analysis_events E
  JOIN strataCohorts S on S.person_id = E.person_id and E.event_id = S.event_id
  join time_at_risk T on T.subject_id = E.person_id and T.cohort_start_date = E.start_date and T.cohort_end_date = E.end_date
  group by T.target_id, T.outcome_id, S.strata_sequence
) T on irs.strata_sequence = T.strata_sequence and T.target_id = combos.target_id and T.outcome_id = combos.outcome_id
WHERE irs.analysis_id = 48
;
-- calculate distributions for TAR and TTO by strata
DELETE FROM webapi_results_cases.ir_analysis_dist where analysis_id = 48;
-- dist_type 1: time at risk
CREATE TEMP TABLE cteRawDataTAR
AS
SELECT
p.target_id, p.outcome_id, p.strata_sequence, p.count_value
FROM
(
  select T.target_id, T.outcome_id, -1 as strata_sequence, T.time_at_risk as count_value
  from time_at_risk T
  UNION ALL
  select T.target_id, T.outcome_id, S.strata_sequence, T.time_at_risk as count_value
  from analysis_events E
  JOIN strataCohorts S on E.person_id = S.person_id and E.event_id = S.event_id
  join time_at_risk T on T.subject_id = E.person_id and T.cohort_start_date = E.start_date and T.cohort_end_date = E.end_date
) p;
ANALYZE cteRawDataTAR
;
CREATE TEMP TABLE overallStatsTAR
AS
SELECT
target_id,
  outcome_id,
  strata_sequence,
  avg(1.0 * count_value) as avg_value,
  STDDEV(count_value) as stdev_value,
  min(count_value) as min_value,
  max(count_value) as max_value,
  COUNT(*) as total
FROM
cteRawDataTAR
group by target_id, outcome_id, strata_sequence;
ANALYZE overallStatsTAR
;
CREATE TEMP TABLE statsTAR
AS
SELECT
target_id, outcome_id, strata_sequence, count_value, COUNT(*) as total, row_number() over (partition by target_id, outcome_id, strata_sequence order by count_value) as rn
FROM
cteRawDataTAR
group by target_id, outcome_id, strata_sequence, count_value;
ANALYZE statsTAR
;
CREATE TEMP TABLE priorStatsTAR
AS
SELECT
s.target_id, s.outcome_id, s.strata_sequence, s.count_value, s.total, sum(p.total) as accumulated
FROM
statsTAR s
join statsTAR p on s.target_id = p.target_id and s.outcome_id = p.outcome_id and s.strata_sequence = p.strata_sequence and p.rn <= s.rn
group by s.target_id, s.outcome_id, s.strata_sequence, s.count_value, s.total, s.rn;
ANALYZE priorStatsTAR
;
CREATE TEMP TABLE tempTARDist
AS
SELECT
o.target_id,
  o.outcome_id,
  o.strata_sequence,
  o.total,
  CAST(o.avg_value AS NUMERIC) as avg_value,
  CAST(coalesce(o.stdev_value, 0.0) AS NUMERIC) as stdev_value,
  o.min_value,
  MIN(case when p.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
  MIN(case when p.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
  MIN(case when p.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
  MIN(case when p.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
  MIN(case when p.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value,
  o.max_value
FROM
priorStatsTAR p
join overallStatsTAR o on p.target_id = o.target_id and p.outcome_id = o.outcome_id and p.strata_sequence = o.strata_sequence
GROUP BY o.target_id, o.outcome_id, o.strata_sequence, o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value;
ANALYZE tempTARDist
;
INSERT INTO webapi_results_cases.ir_analysis_dist (analysis_id, dist_type, target_id, outcome_id, strata_sequence, total, avg_value, std_dev,min_value, p10_value, p25_value, median_value, p75_value, p90_value,max_value)
select 48 as analysis_id, 1 as dist_type, combos.target_id, combos.outcome_id, 
  case when d.strata_sequence = -1 then NULL else d.strata_sequence end as strata_sequence, 
  d.total, d.avg_value, d.stdev_value, d.min_value, d.p10_value, d.p25_value, d.median_value, d.p75_value, d.p90_value, d.max_value
FROM 
(
  select t.cohort_id as target_id, o.cohort_id as outcome_id
  FROM cohorts t
  CROSS JOIN cohorts o
  where t.is_outcome = 0 and o.is_outcome = 1 
) combos
JOIN tempTARDist d on combos.target_id = d.target_id and combos.outcome_id = d.outcome_id
;
-- dist_type 2: TTO (time to outcome)
CREATE TEMP TABLE cteRawDataTTO
AS
SELECT
p.target_id, p.outcome_id, p.strata_sequence, p.count_value
FROM
(
  select T.target_id, T.outcome_id, -1 as strata_sequence, T.time_at_risk as count_value
  from time_at_risk T 
  where T.is_case = 1
  UNION ALL
  select T.target_id, T.outcome_id, S.strata_sequence, T.time_at_risk as count_value
  from analysis_events E
  JOIN strataCohorts S on E.person_id = S.person_id and E.event_id = S.event_id
  join time_at_risk T on T.subject_id = E.person_id and T.cohort_start_date = E.start_date and T.cohort_end_date = E.end_date
  where T.is_case = 1
) p;
ANALYZE cteRawDataTTO
;
CREATE TEMP TABLE overallStatsTTO
AS
SELECT
target_id,
  outcome_id,
  strata_sequence,
  avg(1.0 * count_value) as avg_value,
  STDDEV(count_value) as stdev_value,
  min(count_value) as min_value,
  max(count_value) as max_value,
  COUNT(*) as total
FROM
cteRawDataTTO
group by target_id, outcome_id, strata_sequence;
ANALYZE overallStatsTTO
;
CREATE TEMP TABLE statsTTO
AS
SELECT
target_id, outcome_id, strata_sequence, count_value, COUNT(*) as total, row_number() over (partition by target_id, outcome_id, strata_sequence order by count_value) as rn
FROM
cteRawDataTTO
group by target_id, outcome_id, strata_sequence, count_value;
ANALYZE statsTTO
;
CREATE TEMP TABLE priorStatsTTO
AS
SELECT
s.target_id, s.outcome_id, s.strata_sequence, s.count_value, s.total, sum(p.total) as accumulated
FROM
statsTTO s
join statsTTO p on s.target_id = p.target_id and s.outcome_id = p.outcome_id and s.strata_sequence = p.strata_sequence and p.rn <= s.rn
group by s.target_id, s.outcome_id, s.strata_sequence, s.count_value, s.total, s.rn;
ANALYZE priorStatsTTO
;
CREATE TEMP TABLE tempTTODist
AS
SELECT
o.target_id,
  o.outcome_id,
  o.strata_sequence,
  o.total,
  CAST(o.avg_value AS NUMERIC) as avg_value,
  CAST(coalesce(o.stdev_value, 0.0) AS NUMERIC) as stdev_value,
  o.min_value,
  MIN(case when p.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
  MIN(case when p.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
  MIN(case when p.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
  MIN(case when p.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
  MIN(case when p.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value,
  o.max_value
FROM
priorStatsTTO p
join overallStatsTTO o on p.target_id = o.target_id and p.outcome_id = o.outcome_id and p.strata_sequence = o.strata_sequence
GROUP BY o.target_id, o.outcome_id, o.strata_sequence, o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;
ANALYZE tempTTODist
;
INSERT INTO webapi_results_cases.ir_analysis_dist (analysis_id, dist_type, target_id, outcome_id, strata_sequence, total, avg_value, std_dev,min_value, p10_value, p25_value, median_value, p75_value, p90_value,max_value)
select 48 as analysis_id, 2 as dist_type, combos.target_id, combos.outcome_id, 
  case when d.strata_sequence = -1 then NULL else d.strata_sequence end as strata_sequence, 
  d.total, d.avg_value, d.stdev_value, d.min_value, d.p10_value, d.p25_value, d.median_value, d.p75_value, d.p90_value, d.max_value
FROM 
(
  select t.cohort_id as target_id, o.cohort_id as outcome_id
  FROM cohorts t
  CROSS JOIN cohorts o
  where t.is_outcome = 0 and o.is_outcome = 1 
) combos
JOIN tempTTODist d on combos.target_id = d.target_id and combos.outcome_id = d.outcome_id
;

chrisknoll commented 1 year ago

Thanks. We don't support overlapping observation periods, and I believe that goes against the specification.

Specifically:

Each Person can have more than one valid OBSERVATION_PERIOD record, but no two observation periods can overlap in time for a given person.

The way you defined observation periods will not only impact IR calculations, but all HADES packages assume these rules around Observation Periods, and violating those standards will result in incorrect results.

rkboyce commented 1 year ago

Ok - I missed that! I will drop one of the observation periods today and retest with the multiple T and O IR analyses. If all counts check out, I will close this ticket. Glad we found that out

OHDSI / WebAPI

Issues with IR person counts and cases output #2295

Expected behavior

Actual behavior