ifdef::env-github[] :tip-caption: :bulb: :note-caption: :information_source: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[] :toc: macro
= Oxbow
Oxbow is a project to take an existing storage location which contains link:https://parquet.apache.org[Apache Parquet] files into a link:https://delta.io[Delta Lake table]. It is intended to run both as an AWS Lambda or as a command line application.
The project is named after link:https://en.wikipedia.org/wiki/Oxbow_lake[Oxbow lakes] to keep with the lake theme.
toc::[]
== Using
=== Command Line
Executing cargo build --release
from a clone of this repository will build
the command line binary oxbow
which can be used directly to convert a
directory full of .parquet
files into a Delta table.
This is an in place operation and will convert the specified table location into a Delta table!
% export AWS_REGION=us-west-2 % export AWS_SECRET_ACCESS_KEY=xxxx
=== Lambda
The deployment/
directory contains the necessary Terraform to provision the
function, a DynamoDB table for locking, S3 bucket, and IAM permissions.
After configuring the necessary authentication for Terraform, the following steps can be used to provision:
.parquet
files, this may need to
be tuned.==== Advanced
To help ameliorate
link:https://www.buoyantdata.com/blog/2023-11-27-concurrency-limitations-with-deltalake-on-aws.html[concurrency
challenges for Delta Lake on AWS] with the DynamoDb lock, the deployment/
directory also contains an "advanced" pattern which uses the group-events
Lambda to help serialize S3 Bucket Notifications into an AWS SQS FIFO with
Message Group IDs.
To build all the necessary code locally for the Advanced pattern, please run
make build-release
== Development
Building and testing can be done with cargo: cargo test
.
In order to deploy this in AWS Lambda, it must first be built with the cargo lambda
command line tool, e.g.:
This will produce the file: target/lambda/oxbow-lambda/bootstrap.zip
which can be
uploaded direectly in the web console, or referenced in the Terraform (see
deployment.tf
).
=== Design
==== Command Line
When running oxbow
via command line it is a one time operation. It will
take an existing directory or location full of .parquet
files and create a
Delta table out of it.
==== Lambda
When running oxbow
inside of a AWS Lambda function it should be configured
with an S3 Event Trigger and create new commits to a Delta Lake table any time
a .parquet
file is added to the bucket/prefix.
== Licensing
This repository is licensed under the link:https://www.gnu.org/licenses/agpl-3.0.en.html[AGPL 3.0]. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: rtyler@buoyantdata.com