dacort / damons-data-lake

All the code related to building my own data lake
22 stars 3 forks source link

Build your own data lake

How to build your own cloud-native data lake using AWS with typical data ingest patterns.

Overview

This project uses managed services from AWS to ingest and store various sources of streaming and batch data.

It uses several personal data sources including my own website, nerdy quantified self metrics, and Apple Health data.

Here are the different components used:

You can see what this looks like in the architecture diagram.

Getting started

Currently, I provision this all manually using the AWS console. I will eventually make a CloudFormation template.

Set up the following services - details for each one are below.

S3 Bucket

Kinesis

Glue

For the Glue job that fetches Github stats data, I'll add more detail soon. :) The source code for the Glue job is in src/github-stats.py and an egg file needs to be built and uploaded to S3.

Generating Data

Website

I currently build and deploy the website in web using CodePipeline and CloudFront. This is not required for this demo, but I use it as a source of S3 and CloudFront access logs. :)

Laptop metrics

I have a bunch of nerdy metrics that I generate from my laptop, including:

Currently these are scattered across a few different scripts. Here's how to run them.

Email/tab counts

These metrics are collected using AppleScript and just sent up Kinesis using the AWS CLI.

Specify the Kinesis data stream name as an environment variable and, optionally, an AWS profile:

AWS_PROFILE=profile-name STREAM_NAME=stream-name ./src/macos_data.sh

Clipboard stats

Collected using a poorly-written shell script. Run the same way as email/tab counts.

AWS_PROFILE=profile-name STREAM_NAME=stream-name ./src/clipstats.sh

Analyzing the data!

OK, now that we've got data streaming in and it's being stored in S3, let's use Amazon Athena to query it!

Take a look at a few example Athena queries.

Architecture Diagram

Damons Data Lake

Resources

CodePipeline setup

Other ideas