18F / identity-analytics-etl

ETL and SQL scripts for Login.gov data warehouse and business intelligence
10 stars 6 forks source link

ETL Pipeline need a Worker for undone data file #195

Closed MacHu-GWU closed 5 years ago

MacHu-GWU commented 5 years ago

User story

I would like to improve the data quality in Redshift.

Now the ETL pipeline is triggered by S3 object creation event. If the parser failed on a data file, we leave nothing in the Hot Bucket. We need another worker to trigger the parser for undone data file.

The reason that I don't recommend to add retry to the parser is that sometimes retry is just not able to make it better.

Notes

Acceptance Criteria

Tasks to complete the story

Definition of Done

lauraGgit commented 5 years ago

@MacHu-GWU if we move to kinesis, will we still need this?

MacHu-GWU commented 5 years ago

I think we still needed it in a different way. If the Kinesis ETL pipeline is broken, we still need something to trigger it for missing date time range after we fixed it.

If an ETL system is not 100% stable, we need this REDO things anyway.

lauraGgit commented 5 years ago

ok, that's helpful

MacHu-GWU commented 5 years ago

For example, every time the current ETL pipeline brokes, devops folks like Andy needs to spend lots of time on REDO the missing data. It should be automated.

lauraGgit commented 5 years ago

migrated to jira