[New repo]: InsuranceLake ETL with CDK Pipeline

Description

InsuranceLake allows customers to Collect, Cleanse, and Consume their Insurance Data with 7 simple AWS services.

It is a data lake and pipeline reference architecture built to process batch files by mapping source to target columns, transforming each column, and applying data quality rules. It is based on the Olympic Data Lake Pattern (Bronze, Silver, Gold) which we call Collect, Cleanse & curate, and Consume.

The most common type of batch file data sources are large delimited text files, Excel files, and fixed width files. InsuranceLake can be enhanced to accept change data capture, streaming, and document data sources.

Each incoming data source (e.g. a specific CSV file with commercial auto policies from broker abc) is intended to have a mapping, transform, data quality, and if desired, an entity match instruction file to be paired with it. These instruction files are not mandatory and InsuranceLake will create default ones if none are provided. The incoming data files are placed in the Collect layer, a workflow is then triggered to run the mapping, transform, data quality, and entity match processes and the results are stored in the Cleanse layer. Any data quality rules marked as quarantine will kick bad data out to quarantine tables. Finally a set of Apache Spark SQL and Amazon Athena SQL files can be run to populate the Consume layer.

language

English

runtime

Python

Level

200

Type

Application

Use case

Backend

Primary image

https://raw.githubusercontent.com/aws-samples/aws-insurancelake-etl/main/resources/insurancelake-highlevel-architecture.png

IaC framework

AWS CDK

AWS Serverless services used

[ ] Amazon API Gateway
[X] Amazon DynamoDB
[ ] Amazon EventBridge
[ ] AWS IoT
[X] AWS Lambda
[ ] Amazon Rekognition
[X] Amazon S3
[X] AWS Step Functions
[X] Amazon SNS
[ ] Amazon SQS
[ ] Amazon Transcribe
[ ] Amazon Translate

Description headline

InsuranceLake ETL with CDK Pipeline

Repo URL

https://github.com/aws-samples/aws-insurancelake-etl

Additional resources

https://github.com/aws-samples/aws-insurancelake-infrastructure https://catalog.us-east-1.prod.workshops.aws/workshops/c556569f-5a26-494f-88e1-bac5a55adf2a

Author Name

Multiple authors, names in setup.py

Author Image URL

No response

Author Bio

No response

Author Twitter handle

No response

Author LinkedIn URL

No response

leave

No response

aws-samples / serverless-patterns