SAFEHR-data / FlowEHR

FlowEHR is a safe, secure & cloud-native development & deployment platform for digital healthcare research & innovation.
https://flowehr.io
Apache License 2.0
17 stars 12 forks source link

Data Lake to store ingested and transformed data #160

Closed marrobi closed 1 year ago

marrobi commented 1 year ago

As a data engineer I want a landing zone for raw data, and a place to store transformed data.

This should follow the medallion approach as outlined here -https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion and here https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-zones

As each deployment requirements may differ, the zones and containers created at the time of deployment should be configurable.

For example:

Zone Container
Bronze Raw
Bronze Conformed
Silver Validated
Silver Pseudonymised
Gold Research Ready
Gold Internal Analytics

The data lake should be accessible from Databricks cluster (mount point) and ADF (linked service),

I have started some exploration here: https://github.com/UCLH-Foundry/FlowEHR/compare/main...marrobi:FlowEHR:marrobi/adls-testing

@nels @tanya-borisova @damoodamoo @anastasiakuzn thoughts? I might look to do a PR based of the exploration work.

damoodamoo commented 1 year ago

Hiya! I'd rather us not discuss this during TWS as we're moving toward a prod deployment, and it's not in scope for TWS - and we have very tight deadlines ahead.

Obviously it's a useful feature for use cases other than TWS - but we would need to refactor the terraform somewhat to make the SQL store and the ADLS storage both optional, at least.

Can we revisit this in a few weeks?

marrobi commented 1 year ago

Agree, it's not part of the TWS.

Happy to revisit in two weeks, although as know this is needed I'm looking for any objections to me starting this, and doing a draft PR (if my time allows) to be reviewed when the team's time allows?

jjgriff93 commented 1 year ago

Design doc accepted: https://github.com/UCLH-Foundry/Garden-Path/blob/main/designs/data-lake.md