National-COVID-Cohort-Collaborative / Data-Ingestion-and-Harmonization

Data Ingestion and Harmonization
41 stars 12 forks source link

Palantir Data Characterization High Level Pipeline for each LDS #63

Closed hlehmann17 closed 2 years ago

hlehmann17 commented 3 years ago
Action Code
Apply DQ Checks The DI&H Repository
Apply Transformations  Formulas, Aggregations
<Inform site vs DI&H group>  
Join into LDS DataStore [Includes updating the DataStore version number]  
Regenerate Safe Harbor DataStore [Includes updating the SH version number]  
Inform downstream clients of the update build tag - release version
stephanieshong commented 3 years ago
stephanieshong commented 2 years ago
  1. data is submitted to sFTP location
  2. zip file gets synched and loaded to the Enclave for data transformation
  3. the schedule kicks off the pipeline code based on the CDM. Currently support five CDMs ( OMOP, PCORnet, TriNetX, ACT, PEDsnet)
  4. Data health checks and failures gets logged in the issues that is create for each site.
  5. Human intervention is needed to fill in the details of the issues that gets created
  6. Data quality checks are performed via the DQP and committee determines whether the sites data passed the DQP gates.
  7. The release sit get a manual update whether they get included in the Release LDS or not in the union pipleline bulid that runs over night.
  8. Based on the DQP data quality the site can be retracted from the release state and pulled out from the LDS/SafeHarbor datasets.