Closed marrobi closed 1 year ago
Hiya! I'd rather us not discuss this during TWS as we're moving toward a prod deployment, and it's not in scope for TWS - and we have very tight deadlines ahead.
Obviously it's a useful feature for use cases other than TWS - but we would need to refactor the terraform somewhat to make the SQL store and the ADLS storage both optional, at least.
Can we revisit this in a few weeks?
Agree, it's not part of the TWS.
Happy to revisit in two weeks, although as know this is needed I'm looking for any objections to me starting this, and doing a draft PR (if my time allows) to be reviewed when the team's time allows?
Design doc accepted: https://github.com/UCLH-Foundry/Garden-Path/blob/main/designs/data-lake.md
As a data engineer I want a landing zone for raw data, and a place to store transformed data.
This should follow the medallion approach as outlined here -https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion and here https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-zones
As each deployment requirements may differ, the zones and containers created at the time of deployment should be configurable.
For example:
The data lake should be accessible from Databricks cluster (mount point) and ADF (linked service),
I have started some exploration here: https://github.com/UCLH-Foundry/FlowEHR/compare/main...marrobi:FlowEHR:marrobi/adls-testing
@nels @tanya-borisova @damoodamoo @anastasiakuzn thoughts? I might look to do a PR based of the exploration work.