Uber Coverage and Coverage Report Issue

chrisroederucdenver commented 2 months ago

The following issues are all tied-together with the ideas of identifying documents and sections and what they contain, and then checking if it made it into OMOP or not.

81 is about a new table to track what rule and CCDA document a row in OMOP came from
75 is about #81 having a document ID to work with knowing what CCDA version it is
43 looks at the problem from the POV of new data arrival, or triage
- 46 (closed) is a duplicate
45 (abandoned) was originally about a google sheet to track progress of what we can process, the task matrix
51 (abandoned) is more about the task-matrix
48 asks us to investigate the logs and rule failures
32 also about the logs
130 in the Coverage Report, distinguish missing mappings

The idea is that on-arrival, we know what we should be mining out of a new document. On processing, we should be able to see what we have and haven't gotten out of it, and in some cases an error message relating to the cause. In some cases shortfalls will be about odd data or bugs in the code. In other cases, likely much more uniform, it will be about code that hasn't been written.

Some of those tickets are just evolving ideas, others are ToDos:

[ ] #81 add a provenance table
[ ] #75 identify or create document IDs and link them with names and locations
[ ] ?? rules need IDs, and the provenance table should include them.
[ ] ?? log messages should include both file and rule IDs, load the logs into a table so they are accessible from SQL.
[ ] ?? Do CCDA documents have IDs for the equivalent of OMOP rows? If so, we should collect those too. If not the document ID/name gets us to patient and date-range, maybe we need to collect the _concept_id and a specific date to pin down the exact part of the source document. Point being, the rule will identify the type of entity and attribute, like the value of a measurement, but a document may have 5 of these and you want to know which one if possible.

That much is just data. The table needs schema, queries and visualization. Sketches below.

Schema: CREATE TABLE Provenance (doc_id varchar(100), rule_id varchar(100), ccda_path varchar(250), omop_table varchar(20), omop_row_id integer, ccda_entity_id varchar(100), run_date DATETIME, ccda_vocab, ccda_code, ccda_value, ccda_date) varchar's because they might be useful for hash-created IDs, ccda stuff to have it in the database. The OMOP stuff will be in the tables.
Query 1 selects datums from either side and validates they are the same
Query 2 selects files and rules that don't have a match in OMOP, joins to the log file to see if there is a reported error related to missing data
Query 3 summarizes Query 2 to show what we are able to transfer and what not, characterized by the fraction of source data that made it across into OMOP.
Query 4 considers how domain_id routing and pre/post coordinated concepts mean there is not a 1:1 relationship between source and destination, CCDA and OMOP, data.
Visualization: Using Query 3, a stacked bar-graph by CCDA path or rule_id that shows how often such a path appears in the collection of processed documents, and how often that data made it into OMOP. Ideally, it's all 0% or 100%, but when a rule doesn't match the way a provider puts the data in CCDA, we'll see less than 100%.

chrisroederucdenver commented 2 months ago

There's a follow-on to the Coverage report that goes beyond looking at processing errors or development progress, and considers the "garbage" question. Of the data elements that didn't make it into OMOP which are left behind because there's not an obvious place to put them where they will be found?

A follow-on to that is to evaluate the utility of that data and consider future work that would make it available.

More process-related info like the longer list of providers involved in care, the distinction between medication administration and medication request (order, prescription) will be there. Maybe marital status and religious affiliation. I'm not even thinking about stuff that DOES have a place in OMOP like the ADT transfers (?) between units in the hospital (I barely know this stuff). And things that are less likely to make a de-identified database like names, addresses and relatives.

chrisroederucdenver commented 2 months ago

@AdamLeeIT not to "Squirrel!" a rabbit hole here, but there will be a third branch to the comparison when we get EHR data from AoU for the same patients we're querying from the HIE networks. So we'll have OMOP data from patients that AoU acquired by other sources and we can compare to that.

chrisroederucdenver commented 1 month ago

In the FHIR group meetings Stephanie mentioned traceability from resulting OMOP IDs back to source IDs in FHIR or CCDA. This ties directly into the linkage mentioned here for a coverage report. @AdamLeeIT The focus here is to be able to compare the two, but the same data will allow working back from resulting OMOP into the source data.

cladteam / CCDA_OMOP_by_Python

Uber Coverage and Coverage Report Issue #82

81 is about a new table to track what rule and CCDA document a row in OMOP came from

75 is about #81 having a document ID to work with knowing what CCDA version it is

43 looks at the problem from the POV of new data arrival, or triage

46 (closed) is a duplicate

45 (abandoned) was originally about a google sheet to track progress of what we can process, the task matrix

51 (abandoned) is more about the task-matrix

48 asks us to investigate the logs and rule failures

32 also about the logs

130 in the Coverage Report, distinguish missing mappings