ETL Pipeline need a Worker for undone data file

MacHu-GWU commented 5 years ago

User story

I would like to improve the data quality in Redshift.

Now the ETL pipeline is triggered by S3 object creation event. If the parser failed on a data file, we leave nothing in the Hot Bucket. We need another worker to trigger the parser for undone data file.

The reason that I don't recommend to add retry to the parser is that sometimes retry is just not able to make it better.

Notes

What is the value to the user in this story?*
Avoid missing data in Redshift.
This worker provides the flexibility of doing undone data file at any time we need.

Acceptance Criteria

[ ]

Tasks to complete the story

[ ]

Definition of Done

[ ] Pull requests meet technical definition of done
[ ] Usability tested

lauraGgit commented 5 years ago

@MacHu-GWU if we move to kinesis, will we still need this?

MacHu-GWU commented 5 years ago

I think we still needed it in a different way. If the Kinesis ETL pipeline is broken, we still need something to trigger it for missing date time range after we fixed it.

If an ETL system is not 100% stable, we need this REDO things anyway.

lauraGgit commented 5 years ago

ok, that's helpful

MacHu-GWU commented 5 years ago

For example, every time the current ETL pipeline brokes, devops folks like Andy needs to spend lots of time on REDO the missing data. It should be automated.

lauraGgit commented 5 years ago

migrated to jira

18F / identity-analytics-etl