Palantir Data Characterization High Level Pipeline for each LDS

hlehmann17 commented 3 years ago

Action	Code
Apply DQ Checks	The DI&H Repository
Apply Transformations	Formulas, Aggregations
<Inform site vs DI&H group>
Join into LDS DataStore [Includes updating the DataStore version number]
Regenerate Safe Harbor DataStore [Includes updating the SH version number]
Inform downstream clients of the update	build tag - release version

stephanieshong commented 3 years ago

dqd_instance.R is run and the generated with site'a abbreviated name, i.e. .json , and the json file is stored in the outgoing folder on the sFTP site.
data build version is generated and stored in the LDS data store in the manifest_abridged table for each update.
SafeHarbor dataSet is generated within N3C enclave.
Release data Set is flagged as released=TRUE for safe Harbor build.

stephanieshong commented 2 years ago

data is submitted to sFTP location
zip file gets synched and loaded to the Enclave for data transformation
the schedule kicks off the pipeline code based on the CDM. Currently support five CDMs ( OMOP, PCORnet, TriNetX, ACT, PEDsnet)
Data health checks and failures gets logged in the issues that is create for each site.
Human intervention is needed to fill in the details of the issues that gets created
Data quality checks are performed via the DQP and committee determines whether the sites data passed the DQP gates.
The release sit get a manual update whether they get included in the Release LDS or not in the union pipleline bulid that runs over night.
Based on the DQP data quality the site can be retracted from the release state and pulled out from the LDS/SafeHarbor datasets.

National-COVID-Cohort-Collaborative / Data-Ingestion-and-Harmonization