CDCgov / phdi

https://cdcgov.github.io/dibbs-site/
Creative Commons Zero v1.0 Universal
35 stars 14 forks source link

[EPIC] Automate data reads/writes from IRIS using Azure Synapse #407

Closed emmastephenson closed 1 year ago

emmastephenson commented 1 year ago

Why are we doing this work?

For the LAC pilot to be successful, we need to read LAC patient and case data, and write extracts of eCR data that has passed through our pipeline. This is best accomplished through Azure Synapse, as it will allow us to read and transform data on a regular, scheduled basis using SQL/Python notebooks.

These tasks are likely to require more compute/memory resources than are available in our Azure Container Apps, hence the need to use Synapse. (Synapse is roughly the Azure equivalent of Databricks).

Background and strategic fit

This work is necessary for:

  1. Data from our pipeline to be ingested into IRIS
  2. Patient and incident record linkage to be performed accurately

How does the user interact with this service?

Post-pilot, LAC systems engineers will interact directly with Azure Synapse to make any required tweaks or adjustments.

Acceptance Criteria (Requirements)

Once this epic is completed, the final step of our pipeline will be complete - the MPI and MCI are being seeded from IRIS, and tsv extracts are being sent from our data stores to populate IRIS eCR forms.

Solution Design Doc/Implementation Plan

Azure Synapse Analytics Link: https://drive.google.com/file/d/1bl5OkgZz-XyeRgN2A7SEAnJ5nIPobTvQ/view?usp=sharing

aathwal3 commented 1 year ago

Sprint Goals

emmastephenson commented 1 year ago

Update 4/14:

DanPaseltiner commented 1 year ago

Auth between Synapse and storage as well as Synapse and Postgres MPI has be solved.

DanPaseltiner commented 1 year ago

Work to get code for seeding MPI with LAC extract running in Synapse is happening now.

emmastephenson commented 1 year ago

Sprint goal for next sprint: Everything other than creating the TSVs that go to LAC are complete. Updates, joins on eCR data store/MPI/MCI are complete, so all the data is available.

The sprint after that - going from that data to LAC's custom TSV spec.

emmastephenson commented 1 year ago

Mid-sprint checkin: All Synapse jobs are working; just one clarification question on dates. Next step is to make sure the Synapse jobs are run automatically. Dummy delta lake is merged.

emmastephenson commented 1 year ago

End of next sprint: All the Synapse jobs are running and working as intended.

Marcelle, Brandon, Robert, Nick C, Dan, and Kenneth will be working on this effort. (Marcelle, Robert, Kenneth, and probably Nick C for only half the sprint)

emmastephenson commented 1 year ago

Work to create TSV extracts for IRIS is in progress.

Work to filter by COVID labs is also in progress.

emmastephenson commented 1 year ago

Goal for final sprint:

  1. Seed patient data from IRIS is properly inserted to the MPI
  2. All Synapse jobs are Devops-ified
  3. The Synapse jobs are confirmed to work end-to-end
emmastephenson commented 1 year ago

We think it's done!