InsuranceLake allows customers to Collect, Cleanse, and Consume their Insurance Data with 7 simple AWS services.
It is a data lake and pipeline reference architecture built to process batch files by mapping source to target columns, transforming each column, and applying data quality rules. It is based on the Olympic Data Lake Pattern (Bronze, Silver, Gold) which we call Collect, Cleanse & curate, and Consume.
The most common type of batch file data sources are large delimited text files, Excel files, and fixed width files. InsuranceLake can be enhanced to accept change data capture, streaming, and document data sources.
Each incoming data source (e.g. a specific CSV file with commercial auto policies from broker abc) is intended to have a mapping, transform, data quality, and if desired, an entity match instruction file to be paired with it. These instruction files are not mandatory and InsuranceLake will create default ones if none are provided. The incoming data files are placed in the Collect layer, a workflow is then triggered to run the mapping, transform, data quality, and entity match processes and the results are stored in the Cleanse layer. Any data quality rules marked as quarantine will kick bad data out to quarantine tables. Finally a set of Apache Spark SQL and Amazon Athena SQL files can be run to populate the Consume layer.
Hello this isn't correctly completed. The author information must be submitted in the template, limited to 2 authors per repo. Please submit as a new issue when ready.
Description
InsuranceLake allows customers to Collect, Cleanse, and Consume their Insurance Data with 7 simple AWS services.
It is a data lake and pipeline reference architecture built to process batch files by mapping source to target columns, transforming each column, and applying data quality rules. It is based on the Olympic Data Lake Pattern (Bronze, Silver, Gold) which we call Collect, Cleanse & curate, and Consume.
The most common type of batch file data sources are large delimited text files, Excel files, and fixed width files. InsuranceLake can be enhanced to accept change data capture, streaming, and document data sources.
Each incoming data source (e.g. a specific CSV file with commercial auto policies from broker abc) is intended to have a mapping, transform, data quality, and if desired, an entity match instruction file to be paired with it. These instruction files are not mandatory and InsuranceLake will create default ones if none are provided. The incoming data files are placed in the Collect layer, a workflow is then triggered to run the mapping, transform, data quality, and entity match processes and the results are stored in the Cleanse layer. Any data quality rules marked as quarantine will kick bad data out to quarantine tables. Finally a set of Apache Spark SQL and Amazon Athena SQL files can be run to populate the Consume layer.
language
English
runtime
Python
Level
200
Type
Application
Use case
Backend
Primary image
https://raw.githubusercontent.com/aws-samples/aws-insurancelake-etl/main/resources/insurancelake-highlevel-architecture.png
IaC framework
AWS CDK
AWS Serverless services used
Description headline
InsuranceLake ETL with CDK Pipeline
Repo URL
https://github.com/aws-samples/aws-insurancelake-etl
Additional resources
https://github.com/aws-samples/aws-insurancelake-infrastructure https://catalog.us-east-1.prod.workshops.aws/workshops/c556569f-5a26-494f-88e1-bac5a55adf2a
Author Name
Multiple authors, names in setup.py
Author Image URL
No response
Author Bio
No response
Author Twitter handle
No response
Author LinkedIn URL
No response
leave
No response