BemiHQ / BemiDB

Postgres read replica optimized for analytics
https://bemidb.com
GNU Affero General Public License v3.0
1.05k stars 18 forks source link
analytics data-lakehouse data-movement data-warehouse duckdb iceberg olap parquet postgresql replication zero-etl

BemiDB

BemiDB is a Postgres read replica optimized for analytics. It consists of a single binary that seamlessly connects to a Postgres database, replicates the data in a compressed columnar format, and allows you to run complex queries using its Postgres-compatible analytical query engine.

BemiDB

Contents

Highlights

Use cases

Quickstart

Install BemiDB:

curl -sSL https://raw.githubusercontent.com/BemiHQ/BemiDB/refs/heads/main/scripts/install.sh | bash

Sync data from a Postgres database:

./bemidb --pg-database-url postgres://postgres:postgres@localhost:5432/dbname sync

Run BemiDB database:

./bemidb start

Run Postgres queries on top of the BemiDB database:

# List all tables
psql postgres://localhost:54321/bemidb -c "SELECT * FROM information_schema.tables"

# Query a table
psql postgres://localhost:54321/bemidb -c "SELECT COUNT(*) FROM [table_name]"

Configuration

Local disk storage

By default, BemiDB stores data on the local disk. Here is an example of running BemiDB with default settings and storing data in a local iceberg directory:

./bemidb \
  --port 54321 \
  --database bemidb \
  --storage-type LOCAL \
  --storage-path ./iceberg \ # $PWD/iceberg/*
  --init-sql ./init.sql \
  --log-level INFO \
  start

To run BemiDB with environment variables:

# Default settings
export BEMIDB_PORT=54321
export BEMIDB_DATABASE=bemidb
export BEMIDB_STORAGE_TYPE=LOCAL
export BEMIDB_STORAGE_PATH=./iceberg
export BEMIDB_INIT_SQL=./init.sql
export BEMIDB_LOG_LEVEL=INFO

./bemidb start

S3 block storage

BemiDB natively supports S3 storage. You can specify the S3 settings using the following flags:

./bemidb \
  --port 54321 \
  --database bemidb \
  --storage-type S3 \
  --storage-path iceberg \ # s3://[AWS_S3_BUCKET]/iceberg/*
  --aws-region [AWS_REGION] \
  --aws-s3-bucket [AWS_S3_BUCKET] \
  --aws-access-key-id [AWS_ACCESS_KEY_ID] \
  --aws-secret-access-key [AWS_SECRET_ACCESS_KEY] \
  start

To run BemiDB with environment variables:

export BEMIDB_PORT=54321
export BEMIDB_DATABASE=bemidb
export BEMIDB_STORAGE_TYPE=S3
export BEMIDB_STORAGE_PATH=iceberg
export AWS_REGION=[AWS_REGION]
export AWS_S3_BUCKET=[AWS_S3_BUCKET]
export AWS_ACCESS_KEY_ID=[AWS_ACCESS_KEY_ID]
export AWS_SECRET_ACCESS_KEY=[AWS_SECRET_ACCESS_KEY]

./bemidb start

CLI arguments take precedence over environment variables. I.e. you can override the environment variables with CLI arguments.

Here is the minimal IAM policy required for BemiDB to work with S3:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::[AWS_S3_BUCKET]",
                "arn:aws:s3:::[AWS_S3_BUCKET]/*"
            ]
        }
    ]
}

Periodic data sync

Sync data periodically from a Postgres database:

./bemidb \
  --pg-sync-interval 1h \
  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \
  sync

Alternatively, you can set the interval using environment variables:

export PG_SYNC_INTERVAL=1h
export PG_DATABASE_URL=postgres://postgres:postgres@localhost:5432/dbname

./bemidb sync

Note that incremental real-time replication is not supported yet (WIP). Please see the Future roadmap.

Syncing from multiple Postgres databases

BemiDB supports syncing data from multiple Postgres databases into the same BemiDB database by allowing prefixing schemas.

For example, if two Postgres databases db1 and db2 contain public schemas, you can prefix them as follows:

./bemidb \
  --pg-schema-prefix db1_ \ # or PG_SCHEMA_PREFIX=db1_ using an env variable
  --pg-database-url postgres://postgres:postgres@localhost:5432/db1 \
  sync

./bemidb \
  --pg-schema-prefix db2_ \ # or PG_SCHEMA_PREFIX=db2_ using an env variable
  --pg-database-url postgres://postgres:postgres@localhost:5432/db2 \
  sync

Then you can query and join tables from both Postgres databases in the same BemiDB database:

./bemidb start

psql postgres://localhost:54321/bemidb -c "SELECT * FROM db1_public.[TABLE] JOIN db2_public.[TABLE] ON ..."

Architecture

BemiDB consists of the following main components:

Architecture

Benchmark

BemiDB is optimized for analytical workloads and can run complex queries up to 2000x faster than Postgres.

On the TPC-H benchmark with 22 sequential queries, BemiDB outperforms Postgres by a significant margin:

See the benchmark directory for more details.

Data type mapping

Primitive data types are mapped as follows:

PostgreSQL Parquet Iceberg
bool BOOLEAN boolean
bpchar, varchar, text BYTE_ARRAY (UTF8) string
int2, int4 INT32 int
int8 INT64 long
xid INT32 (UINT_32) int
xid8 INT64 (UINT_64) long
float4, float8 FLOAT float
numeric FIXED_LEN_BYTE_ARRAY (DECIMAL) decimal(P, S)
date INT32 (DATE) date
time, timetz INT64 (TIME_MICROS / TIME_MILLIS) time
timestamp INT64 (TIMESTAMP_MICROS / TIMESTAMP_MILLIS) timestamp / timestamp_ns
timestamptz INT64 (TIMESTAMP_MICROS / TIMESTAMP_MILLIS) timestamptz / timestamptz_ns
uuid FIXED_LEN_BYTE_ARRAY uuid
bytea BYTE_ARRAY (UTF8) binary
interval BYTE_ARRAY (UTF8) string
point, line, lseg, box, path, polygon, circle BYTE_ARRAY (UTF8) string
cidr, inet, macaddr, macaddr8 BYTE_ARRAY (UTF8) string
tsvector, pg_snapshot BYTE_ARRAY (UTF8) string
json, jsonb BYTE_ARRAY (UTF8) string (JSON logical type)
_* (array) LIST * list
* (user-defined type) BYTE_ARRAY (UTF8) string

Note that Postgres json and jsonb types are implemented as JSON logical types and stored as strings (Parquet and Iceberg don't support unstructured data types). You can query JSON columns using standard operators, for example:

SELECT * FROM [TABLE] WHERE [JSON_COLUMN]->>'[JSON_KEY]' = '[JSON_VALUE]';

Future roadmap

Alternatives

BemiDB vs PostgreSQL

PostgreSQL pros:

PostgreSQL cons:

BemiDB vs PostgreSQL extensions

PostgreSQL extensions pros:

PostgreSQL extensions cons:

Main types of extensions for analytics:

BemiDB vs DuckDB

DuckDB pros:

DuckDB cons:

BemiDB vs real-time OLAP databases (ClickHouse, Druid, etc.)

Real-time OLAP databases pros:

Real-time OLAP databases cons:

BemiDB vs big data query engines (Spark, Trino, etc.)

Big data query engines pros:

Big data query engines cons:

BemiDB vs proprietary solutions (Snowflake, Redshift, BigQuery, Databricks, etc.)

Proprietary solutions pros:

Proprietary solutions cons:

Development

We develop BemiDB using Devbox to ensure a consistent development environment without relying on Docker.

To start developing BemiDB and run tests, follow these steps:

cp .env.sample .env
make install
make test

To run BemiDB locally, use the following command:

make up

To sync data from a Postgres database, use the following command:

make sync

License

Distributed under the terms of the AGPL-3.0 License. If you need to modify and distribute the code, please release it to contribute back to the open-source community.