National-COVID-Cohort-Collaborative / Data-Ingestion-and-Harmonization

Data Ingestion and Harmonization
41 stars 12 forks source link

CMS: Medicare - Visit_detail rollup to visit_occurrence - SN visits needs to merge dates with 1 day differences as one #110

Closed stephanieshong closed 7 months ago

stephanieshong commented 11 months ago

SN claims are filed in monthly chunks and the multiple claims exist per patient for the same duration of claim that last more than a month. Merge all these visits into one visit as they are continuous claim for the same patient with multi month stay. For example, as the calendar date moves from Feb 28 to March 1 we have two claims instead of one. Therefore, these monthly claims should be merged to into one visit instead of having multiple visits from the claim dates found in the record for the same patient. They are continuous visit with one day difference for the same patient therefore only one visit should be created for the visit_occurrence, but the details are kept in the visit_detail such that the rollup visit date can be checked and used to match the data getting inserted into the OMOP domains. (visit detail rollup to visit_occurrence)

chrisroederucdenver commented 11 months ago

Status as of 7/21: in the CMS (Medicare) Logic repo under [PPRL] CMS Data & Repo, there is a branch cr/Merge_adjacent_SN_visits

The file test/Merge_SN_visits.sql is early work to adapt macro_visit code/algorithms here that is continued in the following:

Current work will take that forward and create rows in visit_occurrence from the spans, maintaining the links in visit_detail. This will result in fewer visits in visit_occurrence from SN, which is what we want. Then, we need to make sure the code that populates the domains is aligned with the new visit spans. This will include the files observation.sql, procedure_occurrence.sql and others.

There is some analysis in a workbook to see how much the details for a combined visit change month-to-month. Some patients in some files are the same month to month, but you can find some variation too.

Big picture counts are in the cells/nodes

TODO create the SN visit_occurrences from a super-set of what comes in from the three files for prcdrcd, diagnosis and hcpscd.

chrisroederucdenver commented 11 months ago

BTW, further simple exploration of how windowing functions can be used for a rolling sub-total in SQL in stand-alone Spark here on github. I didn't see it explained in what I found that adding an "order by" in a windowing function that normally aggregates, like sum gives you partial aggregations over the order. Not having an order-by doesn't make sense or work with the rank() function because you need an order to assign a rank. If you leave if off with an aggregator like sum, every row gets the complete sum for the whole partitions. It's not that hard, and I'm surprised how many "easy SQL" answer sites on the web leave that detail out. Anyway, this capability in SQL is used to identify groups of visits that are overlapping or adjacent so we can combine them.

One difference between SN aggregation and macro-visits is that macro-visits pull ranges together when they overlap, whereas SN aggregation deals with adjacent visits. For example when one ends on the last day of a month, and the next starts on the first day of the next day of the month.

Another difference, if niggly, is that the dates are in string form in the CMS data and won't sort right if don't think to convert them to dates. For example in alphabetical order 01Jan2020 comes after 01Feb2020.

chrisroederucdenver commented 11 months ago

Investigating safety of having different HCPCS codes in different months of a merged visit. These map either to NHIV or EHR and if they are different should be different visits. The cell OK_merge_to_visit_occurrence in this workbook, shows that they all are codes that map to NHIV. So we have a good start on a visit_occurrence table here.

Four tasks to go forward.

chrisroederucdenver commented 11 months ago

I got the pipeline from SN to visit_detail_prelim on to visit_occurrence and visit_detail with FK's back to visit_occurrence running. Then added some count testing. I found that you do need to be wary of null provider IDs, or your counts won't come out right. Easily enough fixed with a coalesce as is done in the ID creation.

More interestingly, of roughly 600,000 detail visits from SN run through this aggregation, 58 end up with two visit_occurrences (should be just one like the other 599942-odd). I'm going to put that aside for the moment since it's such a small number. It should be looked at on the off chance it's more complicated and worse.

That analysis is in this workbook, in a cell at the bottom querying into sn_visit_spans_with_detail_id.

chrisroederucdenver commented 11 months ago

Looking good today.

chrisroederucdenver commented 11 months ago

Thursday: Wrote an SN_README.md file, cleaned up, found and fixed more minor issues.

Friday: Continuing with the README to get a clearer picture. Found a possible bug or two. Have a better base for converting the domain table population code to using the merged visits.

chrisroederucdenver commented 10 months ago

Monday (on-going): assuming that domain entries should span the wider visit_occurrences, not the visit_detail, so either we roll-up again, or include more info in the visit rollup. The latter so we can map from the rolled-up visit_occurrence to the the domain entries. Sounds good, but the de-normalized nature of SN may mean there are lists of domain events to associate with each visit_occurrence which that OMOP table isn't built for. Looking. Also don't forget the backlog which includes getting visitdetail from the other sn * _ long files..

In the notebook, a cell called DOMAIN_vs_VISIT_span shows that the different codes in a rolled_up visit_occurrence don't all have spans as long as the whole visit_occurrence. Obviously they are split on the monthly boundaries, but some have more than others. So either we put up with multiple domain_events at the level of visit_detail, or we roll them up. It might be an easy max(end_date) query to do so...

Discussed domain event roll-up with Stephanie and we decided against it for now. That is, if you have SN events that come together to form a 5 month long visit_occurrence from 5 separate visit_details, when you create for example condtion_occurrences, there would be 5 of them aligning with the visit_detail rows. In some cases, were the condition only appears part way through the visit_occurrnece, say for the last 2 months, you would get two. In either case they are contiguous and would remain in many parts just like visit_detail.

chrisroederucdenver commented 10 months ago

Remaining ToDos

stephanieshong commented 10 months ago

when you say rename sn_visit_detail to visit_detail, will you be adding source_domain in visit_detail so that we know this pertains to SN?

stephanieshong commented 10 months ago

Also, when you merge the sn_visit_occurrence with the real visit_occurrence , you will need to refresh all visit_occurrence from the SN claim source.

chrisroederucdenver commented 10 months ago

when you say rename sn_visit_detail to visit_detail, will you be adding source_domain in visit_detail so that we know this pertains to SN?

I had to look. It's in there (vist_detail_prelim) and carried through. In my DQ counts, those fields in visit_detail and visit_occurrence are quite useful! Thanks.

chrisroederucdenver commented 10 months ago

Also, when you merge the sn_visit_occurrence with the real visit_occurrence , you will need to refresh all visit_occurrence from the SN claim source.

It's just a single CTE in sn_visit_occurrence.sql, so I may just move that code into the visit_occurrence.sql file rather than having it as it's own table. If I understand your question, this should stitch all the dependencies for visit_occurrence into the graph. But you qualified that with the SN claim source... I'll ask on our meeting Tuesday (sooner if I'm blocked).

chrisroederucdenver commented 10 months ago

One to make sure gets into the SN_README. Procedure_occurrence and Observation only have a single date in OMOP, but both start and end in the SN file. Code for Procedure_occurrence uses the ADMSN_DT (start) date, and code for Observation uses the THRU_DT (end). Q: are these correct? I checked for Procedure that they are usually different. It may seem odd, but it's probably OK that the visit_details rows are created by both start and end as they come in from SN. You just have to know which is used if you're trying to join back from the one date to a visit (like when doing some QA).

chrisroederucdenver commented 10 months ago

No joy on Procedure_occurrence and observation testing. They are tricky because they only have one date, not a start/end pair like the others. I've been refining my tests by isolating to the SN source_domain, and in the case of observation, where it matters, to the omop-side domain. Since matching on date inequalities for these inflates things, I've added the visit_occurrence_id to the join, as I should, and things aren't right. I'll add some carry-through columns for testing. One for the other date and one for the visit_detail_id, and hope that makes it easier to see what's (not) going on. https://unite.nih.gov/workspace/vector/view/ri.vector.main.workbook.ece9e970-fa6b-40a9-a063-ed9031a72e6a?branch=master

chrisroederucdenver commented 10 months ago

It's not just ambiguity introduced b/c of the start date (in the case of observation) that isn't there to help distinguish rows, it's also the fact that many HCPCS codes in the Observation domain don't have concepts and get a 0 for no-matching-concept. To get counts to come up equal on input and output sides, the observations need to map 1:1 so you get as many in the target as source. When one of the dates is not recorded and many concepts map to NMC, it is now M:N where N<M. You get a lower number because the keys that things get mapped to overlap more often, resulting in fewer distinct key values, a lower count, and no reassurance from seeing the same number on either side. In other words, with a lossy mapping, you can't expect counts of distinct keys to come up the same. You can however, join observations to visits and expect the row count to be the same.

stephanieshong commented 10 months ago

https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/issues/110#issuecomment-1675205610 yes, if you would qualified that with the SN claim source then we will know that visit came from SN. Thank you.

chrisroederucdenver commented 10 months ago

I spoke too soon about NMC in observation codes. I found and fixed other issues and then looked more directly for those values and didn't see them. Test workbook, SN_README.md and code are updated. PR submitted.

chrisroederucdenver commented 9 months ago

(back after a long weekend and paying attention to other things)

I learned a bit more about tests built into the pipeline, and running them found a few more issues to look into. Primarily, this involves duplicate keys in the observation and procedure_occurrence tables that probably has to do with the fact those tables have a single date, not start and end. A quick fix is to include the visit_detail_id in the hash for the PK, but I looked a little closer and found some other issues relating to open joins and possibly using ADMSN_DT instead of FROM_DT where we end up with multiple date-ranges that start on the same day...resulting in duplicate observations or procedures on the same day.

chrisroederucdenver commented 9 months ago

So adding visit_detail_id fixes the PK problem and is in the branch, built and tested. Currently testing FROM_DT instead of ADMSN_DT for SN, along with an open join in the final visit_occurrence creation.

In the future, I intend to write an explicit data model to keep the natural PKs (the columns used in the hash function to create the artificial PKs like visit_occurrence_id) straight, even though in many cases once the hash is right, you should just use the single-column artificial PK.

chrisroederucdenver commented 9 months ago

cut/paste from my Monarch weekly status below. Short story is I did a complete build on the branch last night and the failure it had is a side-path for some testing. Will have that fixed this morning, and we should be good to go.

chrisroederucdenver commented 9 months ago

SN build is complete PR ready to merge Workbook @stephanieshong

chrisroederucdenver commented 7 months ago

Merged Sep. 21 @stephanieshong I don't have permission to close.

stephanieshong commented 7 months ago

Merged, built and released.

stephanieshong commented 7 months ago

@chrisroederucdenver - closed. Thank you.