Backgroud
This project is aiming to build a whole cloud based DevOps ETL process. Include below Parts:
AWS
- Cloud Infrastructure
- Jenkins on ECS
- Airflow on EKS
- Airflow framework(wrapper)
- Jenkins Devops Pipeline
- Glue ETL Common Solution
- Multi-account architecture
Power BI
- Front end development & design
- Backend development & design
- DB development & design
Azure
- User/Role Management Architecture
- Network/Security Architecture
- DevOps Architecture
- Infrastructure Level DevOps
- Project Level DevOps
- Project Architecture
- ETL framework/solution
- Data Visualization(PowerBI)
Project Name
Cloud base ETL DevOps process of Community = CEDC
Project Directory
Project Wiki
Project Wiki
Project Sprint
Sprint
Architecture
basic logicflow
Cloud Infrastructure
Account distribution
- DevOps Account: this is a DevOps account mainly include Jenkins and Airflow
- Data Account: this is a data lake account mainly include S3
- Serverless Account: this is a ETL account mainly include Glue, Lambda etc
- IDP Account: this is a Identity account which can assume A/B/C accounts by User role or Admin Role
jenkins Infrastructure
Note: in the first draft, we can centralized deploy all services into one account for demo purpose.
Airflow framework
Features
- Parameter driven framework
- Check Dependence
- Kickoff
- Monitor
- Job Retry
- Notify
- Metadata backend
Jenkins DevOps Pipeline
Features
- Deploy airflow dags and glue job in project
- Onboarding/Off Boarding
- Data validation
- Convert SQL to Glue Pyspark
Glue ETL jobs
Account prerequisite
Standard aws serverless account with below items:
- Glue
- Lambda
- S3
- Cloudwatch Events
- Cloudwatch logs
- Secrets manager
- wip ...
Glue
Glue job naming standard:
-
__prelanding
-
__landing
-
__landing_merge
-
__refinement
-
__publish
IAM Roles Management
- Serverless Account: Glue Job Execution role -> DEVOPS_GLUE_CEDC_EXECUTION (cross account role to ensure Airflow can trigger glue jobs on Account C)
- DevOps Account: DEVOPS_GLUE_CEDC_READ/DEVOPS_GLUE_CEDC_ADMIN (Readonly or Admin)
- IDP Account: CICD Role: DEVOPS_CICD_CEDC (which will assume admin access for all accounts for now.)
- Data Account: DEVOPS_S3_CEDC_READ/DEVOPS_S3_CEDC_ADMIN
OpenAI