Transformation scripts have been created manually for various datasets. We want to start automating the transformation process and pushing the data to a specified target bucket on S3. This was started in a previous ticket mentioned here.
Since the ODIAC dataset is an exception (to date) with respect to the data size, we will focus on a dataset with smaller file sizes allowing us to work within the resource constraints.
In this sprint, I will be focusing on running the transformation pipeline for the CMS flux dataset. The transformation is already done and we do have the COGs for the dataset. Because the file size for this dataset is quite small, I am using this to test-run the pipeline.
Acceptance Criteria
[x] Run the DAG for an entire pipeline which consists of fetching the data from a bucket, transforming them, and pushing them back to a target bucket.
[x] Check the COGs created.
Stretch criteria
[ ] Calculate the statistics of data (efficiently).
Description
Transformation scripts have been created manually for various datasets. We want to start automating the transformation process and pushing the data to a specified target bucket on S3. This was started in a previous ticket mentioned here. Since the ODIAC dataset is an exception (to date) with respect to the data size, we will focus on a dataset with smaller file sizes allowing us to work within the resource constraints.
In this sprint, I will be focusing on running the transformation pipeline for the CMS flux dataset. The transformation is already done and we do have the COGs for the dataset. Because the file size for this dataset is quite small, I am using this to test-run the pipeline.
Acceptance Criteria
Stretch criteria