cloudfuse-io / buzz-rust

Serverless query engine
MIT License
140 stars 11 forks source link

:honeybee: Buzz Rust :honeybee:

License: MIT Master codecov

Warning This project is a POC and is not actively maintained. It's dependencies are outdated, which might introduce security vulnerabilities.

Buzz is best defined by the following key concepts:

Functionally, it can be compared to the managed analytics query services offered by all major cloud providers, but differs in that it is open source and uses lower-level components such as cloud functions and cloud storage.

The current implementation is in Rust and is based on Apache Arrow with the DataFusion engine.

Architecture

Buzz is composed of three systems:

Design overview

Buzz analytics queries are defined with SQL. The query is composed of different statements for the different stages (see example here). This is different from most distributed engines that have a scheduler that takes care of splitting a unique SQL query into multiple stages to be executed on different executors. The reason is that in Buzz, our executors (HBees and HCombs) have very different behaviors and capabilities. This is unusual and designing a query planner that understands this is not obvious. We prefer leaving it to our dear users, who are notoriously known to be smart, to decide what part of the query should be executed where. Further down the road, we might come up with a scheduler that is able to figure this out automatically.

Note: the h in hbee and hcomb stands for honey, of course ! :smiley:

Tooling

The Makefile contains all the usefull commands to build, deploy and test-run Buzz, so you will likely need make to be installed.

Because Buzz is a cloud native query engine, many commands require an AWS account to be run. The Makefile uses the credentials stored in your ~/.aws/credentials file (cf. doc) to access your cloud accounts on your behalf.

When running commands from the Makefile, you will be prompted to specify the profiles (from the AWS creds file), region and stage you whish to use. If you don't want to specify these each time you run a make command, you can specify defaults by creating a file named default.env at the root of the project with the following content:

REGION:=eu-west-1|us-east-1|...
DEPLOY_PROFILE:=profile-where-buzz-will-be-deployed
BACKEND_PROFILE:=profile-that-has-access-to-the-s3-terraform-backend
STAGE:=dev|prod|...

Build

Build commands can be found in the Makefile.

The AWS Lambda runtime runs a custom version of linux. To keep ourselves out of trouble we use musl instead of libc to make a static build. For reproducibility reasons, this build is done through docker.

Note: docker builds require BuildKit (included in docker 18.09+) and the Docker Engine API 1.40+ with experimental features activated.

Deploy

The code can be deployed to AWS through terraform:

Notes:

Running queries

To run queries using the Makefile, you need the AWS cli v2+ to be installed.

Example query (cf. code/examples/query-delta-taxi.json):

{
    "steps": [
        {
            "sql": "SELECT payment_type, COUNT(payment_type) as payment_type_count FROM nyc_taxi GROUP BY payment_type",
            "name": "nyc_taxi_map",
            "step_type": "HBee",
            "partition_filter": "pickup_date<='2009-01-05'"
        },
        {
            "sql": "SELECT payment_type, SUM(payment_type_count) FROM nyc_taxi_map GROUP BY payment_type",
            "name": "nyc_taxi_reduce",
            "step_type": "HComb"
        }
    ],
    "capacity": {
        "zones": 1
    },
    "catalogs": [
        {
            "name": "nyc_taxi",
            "type": "DeltaLake",
            "uri": "s3://cloudfuse-taxi-data/delta-tables/nyc-taxi-daily"
        }
    ]
}

A query is a succession of steps. The HBee step type means that this part of the query runs in cloud functions (e.g AWS Lambda). The HComb step type means the associated query part runs on the container reducers (e.g AWS Fargate). The output of one step should be used as input (FROM statement) of the next step by refering to it by the step's name.

A query takes the target list of file from a catalog. Supported catalogs are Static (file list compiled into the binary) and DeltaLake.

The capacity.zone field indicates the number of availability zones (and thus containers) used for HComb steps. This can be used to improve reducing capability and minimize cross-AZ data exchanges (both slower and more expensive).

In the HBee step, you can specify a partition_filter field with an SQL filtering expression on partitioning dimensions. Currently partition values can only be strings.

Current limitations:

Note that the first query is slow (and might even timeout!) because it first needs to start a container for the HComb, which typically takes 15-25s on Fargate. Subsequent queries are much faster because they reuse the HComb container. The HComb is stopped after a configurable duration of inactivity (typically 5 minutes).