How to build your own cloud-native data lake using AWS with typical data ingest patterns.
This project uses managed services from AWS to ingest and store various sources of streaming and batch data.
It uses several personal data sources including my own website, nerdy quantified self metrics, and Apple Health data.
Here are the different components used:
You can see what this looks like in the architecture diagram.
Currently, I provision this all manually using the AWS console. I will eventually make a CloudFormation template.
Set up the following services - details for each one are below.
raw/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
kinesisErrors/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/!{firehose:error-output-type}
s3://<bucket>/raw/life/
Create a single schema for each S3 path
is checked under "Grouping behavior"Update all new and existing partitions with metadata from the table.
is checked under "Configuration options"For the Glue job that fetches Github stats data, I'll add more detail soon. :) The source code for the Glue job is in src/github-stats.py and an egg file needs to be built and uploaded to S3.
I currently build and deploy the website in web using CodePipeline and CloudFront. This is not required for this demo, but I use it as a source of S3 and CloudFront access logs. :)
I have a bunch of nerdy metrics that I generate from my laptop, including:
Currently these are scattered across a few different scripts. Here's how to run them.
These metrics are collected using AppleScript and just sent up Kinesis using the AWS CLI.
Specify the Kinesis data stream name as an environment variable and, optionally, an AWS profile:
AWS_PROFILE=profile-name STREAM_NAME=stream-name ./src/macos_data.sh
Collected using a poorly-written shell script. Run the same way as email/tab counts.
AWS_PROFILE=profile-name STREAM_NAME=stream-name ./src/clipstats.sh
OK, now that we've got data streaming in and it's being stored in S3, let's use Amazon Athena to query it!
Take a look at a few example Athena queries.
ioreg -c AppleBacklightDisplay | grep brightness | cut -f2- -d= | sed 's/=/:/g' | jq -c '.brightness'
)log show --style syslog --predicate 'process == "loginwindow"' --debug --info --last 1d | grep "Verify password called with PAM auth set to YES, but pam handle == nil"
)