CMS: Medicare - Visit_detail rollup to visit_occurrence - SN visits needs to merge dates with 1 day differences as one

stephanieshong commented 11 months ago

SN claims are filed in monthly chunks and the multiple claims exist per patient for the same duration of claim that last more than a month. Merge all these visits into one visit as they are continuous claim for the same patient with multi month stay. For example, as the calendar date moves from Feb 28 to March 1 we have two claims instead of one. Therefore, these monthly claims should be merged to into one visit instead of having multiple visits from the claim dates found in the record for the same patient. They are continuous visit with one day difference for the same patient therefore only one visit should be created for the visit_occurrence, but the details are kept in the visit_detail such that the rollup visit date can be checked and used to match the data getting inserted into the OMOP domains. (visit detail rollup to visit_occurrence)

chrisroederucdenver commented 11 months ago

Status as of 7/21: in the CMS (Medicare) Logic repo under [PPRL] CMS Data & Repo, there is a branch cr/Merge_adjacent_SN_visits

The file test/Merge_SN_visits.sql is early work to adapt macro_visit code/algorithms here that is continued in the following:

visit_detaill.sql that puts the vists from SN into visit_detail instead of visit_occurrence as before
sn_visit_spans.sql that identifies the spans of merged visits, reading visit_detail
sn_visit_spans_with_detail_id that links the rows in visit_detail (that have the monthly breaks) with rows in sn_visit_spans.

Current work will take that forward and create rows in visit_occurrence from the spans, maintaining the links in visit_detail. This will result in fewer visits in visit_occurrence from SN, which is what we want. Then, we need to make sure the code that populates the domains is aligned with the new visit spans. This will include the files observation.sql, procedure_occurrence.sql and others.

There is some analysis in a workbook to see how much the details for a combined visit change month-to-month. Some patients in some files are the same month to month, but you can find some variation too.

for the three files, a spot-check on 3 patients showed
- month-to-month consistency for sn_diagnosis_long
- no data for sn_prcdrcd_long
- some variation of HCPCS codes in sn_hcpscd_long for one that had many codes. The other two had the same single code for each segment. Since we have limited vocabulary for HCPCS, so just source_concepts, I plan to create a single visit_occurrence without worry, leaving the detail in visit_detail. Details in this regard can be seen in the cells with names starting "HCPCSfor" and "Dxfor".
Also "instances of different counts" has queries for all three on all three files unioned and commented out.

Big picture counts are in the cells/nodes

Compare_merged_to_unmerged (needs updating from the test file to visit_occurrence and visit_detail as dev. progresses TODO)

TODO create the SN visit_occurrences from a super-set of what comes in from the three files for prcdrcd, diagnosis and hcpscd.

chrisroederucdenver commented 11 months ago

BTW, further simple exploration of how windowing functions can be used for a rolling sub-total in SQL in stand-alone Spark here on github. I didn't see it explained in what I found that adding an "order by" in a windowing function that normally aggregates, like sum gives you partial aggregations over the order. Not having an order-by doesn't make sense or work with the rank() function because you need an order to assign a rank. If you leave if off with an aggregator like sum, every row gets the complete sum for the whole partitions. It's not that hard, and I'm surprised how many "easy SQL" answer sites on the web leave that detail out. Anyway, this capability in SQL is used to identify groups of visits that are overlapping or adjacent so we can combine them.

One difference between SN aggregation and macro-visits is that macro-visits pull ranges together when they overlap, whereas SN aggregation deals with adjacent visits. For example when one ends on the last day of a month, and the next starts on the first day of the next day of the month.

Another difference, if niggly, is that the dates are in string form in the CMS data and won't sort right if don't think to convert them to dates. For example in alphabetical order 01Jan2020 comes after 01Feb2020.

chrisroederucdenver commented 11 months ago

Investigating safety of having different HCPCS codes in different months of a merged visit. These map either to NHIV or EHR and if they are different should be different visits. The cell OK_merge_to_visit_occurrence in this workbook, shows that they all are codes that map to NHIV. So we have a good start on a visit_occurrence table here.

Four tasks to go forward.

One is cover the base where what looks like a single visit_occurrence merged from the visit_detail dates, must be more than one because of different codes as above (arguably not needed immediately).
Another is how to merge the resulting sn_visit_occurrence into visit_occurrence.
- could be as trivial as "insert into visit_occurrence select * from sn_visit_occurrence"
A third, and top-priority is to populate the visit_details rows with the visit_occurrence that they belong to.
- hopefully a trivial update statement, but does SparkSQL have an update? (no). Perhaps from Python, and how to get it into the right place in the sequence?
Fourth is to populate the domains to the merged visit in visit_occurrence instead of the separate parts in visit_detail as more/less currently done. (because what this code puts into visit_detail, existing code puts directly into visit_occurrence).

chrisroederucdenver commented 11 months ago

I got the pipeline from SN to visit_detail_prelim on to visit_occurrence and visit_detail with FK's back to visit_occurrence running. Then added some count testing. I found that you do need to be wary of null provider IDs, or your counts won't come out right. Easily enough fixed with a coalesce as is done in the ID creation.

More interestingly, of roughly 600,000 detail visits from SN run through this aggregation, 58 end up with two visit_occurrences (should be just one like the other 599942-odd). I'm going to put that aside for the moment since it's such a small number. It should be looked at on the off chance it's more complicated and worse.

That analysis is in this workbook, in a cell at the bottom querying into sn_visit_spans_with_detail_id.

chrisroederucdenver commented 11 months ago

Looking good today.

null provider_ids were causing counts to be off (nulls don't exist when counted), so I coalesced them to 0 early on in sn_visit_detail_prelim. @stephanieshong
my date diff logic in creating the spans was off
a group-by also in the logic for creating the spans was off. Fixing these results in lots of consistency in counts at the bottom of the worksheet mentioned above. Not just table counts, but joining between visit_detail an spans, note that one cell labelled SIXTY_TWO_DUPES comes up with an empty result, meaning no duplicate entries now.

chrisroederucdenver commented 11 months ago

Thursday: Wrote an SN_README.md file, cleaned up, found and fixed more minor issues.

Since we pull domain data from SN via more than just sn_hcpscd, by adding sn_diagnosis_long and sn_prcdrcd_long, we need to include the visit_detail rows from them. Out of 650k or so, this adds 8k.
I considered the nearly trivial potential for a bug when treating BID as a string vs person_id as a long and found that it's fine.

Friday: Continuing with the README to get a clearer picture. Found a possible bug or two. Have a better base for converting the domain table population code to using the merged visits.

other domains that pull from sn_hcpscd_long only join on NHIV visit types, and don't include ER
the observation domain joins with visit on the end date, and start date makes more sense to me (observation only has one date, not start and end. The visits have start and end.)

chrisroederucdenver commented 10 months ago

Monday (on-going): assuming that domain entries should span the wider visit_occurrences, not the visit_detail, so either we roll-up again, or include more info in the visit rollup. The latter so we can map from the rolled-up visit_occurrence to the the domain entries. Sounds good, but the de-normalized nature of SN may mean there are lists of domain events to associate with each visit_occurrence which that OMOP table isn't built for. Looking. Also don't forget the backlog which includes getting visitdetail from the other sn * _ long files..

In the notebook, a cell called DOMAIN_vs_VISIT_span shows that the different codes in a rolled_up visit_occurrence don't all have spans as long as the whole visit_occurrence. Obviously they are split on the monthly boundaries, but some have more than others. So either we put up with multiple domain_events at the level of visit_detail, or we roll them up. It might be an easy max(end_date) query to do so...

Discussed domain event roll-up with Stephanie and we decided against it for now. That is, if you have SN events that come together to form a 5 month long visit_occurrence from 5 separate visit_details, when you create for example condtion_occurrences, there would be 5 of them aligning with the visit_detail rows. In some cases, were the condition only appears part way through the visit_occurrnece, say for the last 2 months, you would get two. In either case they are contiguous and would remain in many parts just like visit_detail.

chrisroederucdenver commented 10 months ago

Remaining ToDos

consider and add the other sn_*_long files
write/copy counts for other domains like i did for condition_occurrence
merge the sn_visit_occurrence with the real visit_occurrence table (or just write to it directly...)
remove sn_condition_occurrence (merged yesterdaY)
rename sn_visit_detail to visit_detail
re-read the macro visit paper
why is observation's date from the end date?
Does other stuff like observation_period and macro_visit need attention?
Double check the flow
update and review the sn_README.md

stephanieshong commented 10 months ago

when you say rename sn_visit_detail to visit_detail, will you be adding source_domain in visit_detail so that we know this pertains to SN?

stephanieshong commented 10 months ago

Also, when you merge the sn_visit_occurrence with the real visit_occurrence , you will need to refresh all visit_occurrence from the SN claim source.

chrisroederucdenver commented 10 months ago

when you say rename sn_visit_detail to visit_detail, will you be adding source_domain in visit_detail so that we know this pertains to SN?

I had to look. It's in there (vist_detail_prelim) and carried through. In my DQ counts, those fields in visit_detail and visit_occurrence are quite useful! Thanks.

chrisroederucdenver commented 10 months ago

Also, when you merge the sn_visit_occurrence with the real visit_occurrence , you will need to refresh all visit_occurrence from the SN claim source.

It's just a single CTE in sn_visit_occurrence.sql, so I may just move that code into the visit_occurrence.sql file rather than having it as it's own table. If I understand your question, this should stitch all the dependencies for visit_occurrence into the graph. But you qualified that with the SN claim source... I'll ask on our meeting Tuesday (sooner if I'm blocked).

chrisroederucdenver commented 10 months ago

One to make sure gets into the SN_README. Procedure_occurrence and Observation only have a single date in OMOP, but both start and end in the SN file. Code for Procedure_occurrence uses the ADMSN_DT (start) date, and code for Observation uses the THRU_DT (end). Q: are these correct? I checked for Procedure that they are usually different. It may seem odd, but it's probably OK that the visit_details rows are created by both start and end as they come in from SN. You just have to know which is used if you're trying to join back from the one date to a visit (like when doing some QA).

chrisroederucdenver commented 10 months ago

No joy on Procedure_occurrence and observation testing. They are tricky because they only have one date, not a start/end pair like the others. I've been refining my tests by isolating to the SN source_domain, and in the case of observation, where it matters, to the omop-side domain. Since matching on date inequalities for these inflates things, I've added the visit_occurrence_id to the join, as I should, and things aren't right. I'll add some carry-through columns for testing. One for the other date and one for the visit_detail_id, and hope that makes it easier to see what's (not) going on. https://unite.nih.gov/workspace/vector/view/ri.vector.main.workbook.ece9e970-fa6b-40a9-a063-ed9031a72e6a?branch=master

chrisroederucdenver commented 10 months ago

It's not just ambiguity introduced b/c of the start date (in the case of observation) that isn't there to help distinguish rows, it's also the fact that many HCPCS codes in the Observation domain don't have concepts and get a 0 for no-matching-concept. To get counts to come up equal on input and output sides, the observations need to map 1:1 so you get as many in the target as source. When one of the dates is not recorded and many concepts map to NMC, it is now M:N where N<M. You get a lower number because the keys that things get mapped to overlap more often, resulting in fewer distinct key values, a lower count, and no reassurance from seeing the same number on either side. In other words, with a lossy mapping, you can't expect counts of distinct keys to come up the same. You can however, join observations to visits and expect the row count to be the same.

stephanieshong commented 10 months ago

https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/issues/110#issuecomment-1675205610 yes, if you would qualified that with the SN claim source then we will know that visit came from SN. Thank you.

chrisroederucdenver commented 10 months ago

I spoke too soon about NMC in observation codes. I found and fixed other issues and then looked more directly for those values and didn't see them. Test workbook, SN_README.md and code are updated. PR submitted.

chrisroederucdenver commented 9 months ago

(back after a long weekend and paying attention to other things)

I learned a bit more about tests built into the pipeline, and running them found a few more issues to look into. Primarily, this involves duplicate keys in the observation and procedure_occurrence tables that probably has to do with the fact those tables have a single date, not start and end. A quick fix is to include the visit_detail_id in the hash for the PK, but I looked a little closer and found some other issues relating to open joins and possibly using ADMSN_DT instead of FROM_DT where we end up with multiple date-ranges that start on the same day...resulting in duplicate observations or procedures on the same day.

chrisroederucdenver commented 9 months ago

So adding visit_detail_id fixes the PK problem and is in the branch, built and tested. Currently testing FROM_DT instead of ADMSN_DT for SN, along with an open join in the final visit_occurrence creation.

In the future, I intend to write an explicit data model to keep the natural PKs (the columns used in the hash function to create the artificial PKs like visit_occurrence_id) straight, even though in many cases once the hash is right, you should just use the single-column artificial PK.

chrisroederucdenver commented 9 months ago

cut/paste from my Monarch weekly status below. Short story is I did a complete build on the branch last night and the failure it had is a side-path for some testing. Will have that fixed this morning, and we should be good to go.

finished fixing an problem with the keys in SN observation and procedure_occurrence. They lose one of a pair of dates describing a range (one uses the start, the other the end date) in the translation from CMS to OMOP and in some cases you get duplicates. Adding the missing date to code where it is available and to the creation of the surrogate key fixes this.
replaced ADMSN_DT with FROM_DT in CMS SN. The visit_details, when created with ADMSN_DT get the same start date. It looks and works better with FROM_DT.
found and fixed some relative rare open joins that caused some incorrect duplication. This happens in situations where you have to use the components of a compound key and don't have enough components to get to uniqueness. In this case the code is early enough in the process that an artificial or surrogate hasn't yet been created. The additional field comes from code we added earlier in the summer that creates ER visits in addition to NHIV visits. visit_concept_id is a key field in visit_occurrence, so with identical patient and dates, two visits are different if they have different visit_concept_ids.
- Though there are ddl files for OMOP available from OHDSI, we don't work with an ERD here and we don't extend it with the SN files and intermediate tables. My testing work started with efforts that clearly identify that natural keys, but I haven't taken it far enough. I have a personal license of Astah (formerly Jude), have imported the OMOP 5.4 tables, and will extend it for TL as that effort starts. The important step (and lesson) is to carefully refer to it when I find myself joining using compound keys. A step further is to guard the model carefully and respectfully, and notice the impact of changes.
- A related note is that I have asked Accumen for a model, though I haven't pressed the issue, and not gotten a response. It's hard to do in 5 minutes on a call with many people, but some detail about how things work might save us some time. The ADMSN_DT might be an example for this. I suspect they recommended using that field, and my intuition is that there is an easier solution for creating/identifying the aggregation of individual visit_detail rows into a visit_occurrence by using the ADMSN_DT field. I also suspect that that reasoning, if presented, was lost. C'est la vie.

chrisroederucdenver commented 9 months ago

SN build is complete PR ready to merge Workbook @stephanieshong

chrisroederucdenver commented 7 months ago

Merged Sep. 21 @stephanieshong I don't have permission to close.

stephanieshong commented 7 months ago

Merged, built and released.

stephanieshong commented 7 months ago

@chrisroederucdenver - closed. Thank you.

National-COVID-Cohort-Collaborative / Data-Ingestion-and-Harmonization

CMS: Medicare - Visit_detail rollup to visit_occurrence - SN visits needs to merge dates with 1 day differences as one #110