National-COVID-Cohort-Collaborative / Phenotype_Data_Acquisition

The repository for code and documentation produced by the N3C Phenotype and Data Acquisition workstream
60 stars 35 forks source link

control_map null values #221

Open chrisroederucdenver opened 1 year ago

chrisroederucdenver commented 1 year ago

Michael Kahn at CU and I looked into why the phenotyping code was leaving nulls in the control_map. Our finding is basically that the script ran out of qualified control individuals when mapping two controls to each case. We looked at the problem from at least two angles.

The first involved the four successively weaker stages of matching combinations of age, ethnicity, race and sex attributes. We added a column to the control_map table that indicates the level of match a buddy/control got during the assignment process. Our hypothesis included wondering if more controls would have been matched if there were another level where none of the attributes had to match. Adding that, we found a few more came through, but there still weren't enough controls. This led us to the second angle, the code that filters controls based on visits after 1/1/2018, and requiring at least two that are at least ten days apart. Weakening these created enough inventory to fill the map. Working through both convinced us the code works as intended.

The fix could either be loosen definition of a control (like we did when we played with the 10-day visit window) or, if there is concern about even weaker controls, then agree to allow NULL controls and understand why this is happening. And let analysts know about null controls so they can give some thought if this impacts analysis.

One enhancement to consider suggesting is adding the match criteria number (we called in catno in our work) with each match so that an analyst would have the opportunity to select the “strength” of the control they are using from “very strong” (all four match criteria are met) to “very weak” (include the last sex-only folks). Currently, this information is not made available in the data set sent to N3C from the site.

The following query shows the state of various tables in the process, and should help identify a site's specific performance. We found at the lowest level, sex only, that the balance of available controls differed by sex, as did the remaining cases. Because they didn't differ by the same fraction, one ran out. It's conceivable similar issues happen at higher levels. Using the additional column mentioned above, might help to identify where.

with cases as ( select "Cases" as label, sex, count(person_id)*2 as nrows from kratos_n3c_results.n3c_case_cohort group by label,sex ), controls as ( select "Controls" as label , sex, count( person_id) as nrows from kratos_n3c_results.n3c_control_cohort group by label,sex ) select sex, label, nrows from (select sex, label, nrows from cases UNION ALL select sex, label, nrows from controls) order by sex

mgkahn commented 1 year ago

To any site using the code posted by @chrisroederucdenver, replace kratos_n3c_results with @resultsDatabaseSchema.

mgkahn commented 1 year ago

Pasting here the results of the above query on Colorado OMOP data. It shows that for all three values for sex, there are insufficient number of controls to provide 2x matches against cases.

image