US-GHG-Center / ghgc-docs

https://us-ghg-center.github.io/ghgc-docs/
3 stars 2 forks source link

Automated transformation pipeline #105

Closed SwordSaintLancelot closed 1 month ago

SwordSaintLancelot commented 1 month ago

Description

Transformation scripts have been created manually for various datasets. We want to start automating the transformation process and pushing the data to a specified target bucket on S3. This was started in a previous ticket mentioned here. Since the ODIAC dataset is an exception (to date) with respect to the data size, we will focus on a dataset with smaller file sizes allowing us to work within the resource constraints.

In this sprint, I will be focusing on running the transformation pipeline for the CMS flux dataset. The transformation is already done and we do have the COGs for the dataset. Because the file size for this dataset is quite small, I am using this to test-run the pipeline.

Acceptance Criteria

SwordSaintLancelot commented 1 month ago

PR for the transformation DAG. The next steps will be proceeded in the same DAG configuration so the PR will be continued.