gchq / sleeper

A cloud-native, serverless, scalable, cheap key-value store
Apache License 2.0
64 stars 10 forks source link

Sleeper

Introduction

Sleeper is a serverless, cloud-native, log-structured merge tree based, scalable key-value store. It is designed to allow the ingest of very large volumes of data at low cost. Data is stored in rows in tables. Each row has a key field, and an optional sort field, and some value fields. Queries for rows where the key takes a given value takes around 1-2 seconds, but many thousands can be run in parallel. Each individual query has a negligible cost.

Sleeper can be thought of as a cloud-native reimagining of systems such as Hbase and Accumulo. The architecture is very different to those systems. Sleeper has no long running servers. This means that if there is no work to be done, i.e. no data is being ingested and no background operations such as compactions are in progress, then the only cost is the cost of the storage. There are no wasted compute cycles, i.e. it is "serverless".

The current codebase can only be deployed to AWS, but there is nothing in the design that limits it to AWS. In time we would like to be able to deploy Sleeper to other public cloud environments such as Microsoft Azure or to a Kubernetes cluster.

Note that Sleeper is currently a prototype. Further development and testing is needed before it can be considered to be ready for production use.

Functionality

Sleeper stores records in tables. A table is a collection of records that conform to a schema. A record is a map from a field name to value. For example, a schema might have a row key field called 'id' of type string, a sort field called 'timestamp' of type long, and a value field called 'name' of type string. Each record in a table with that schema is a map with keys of id, timestamp and name. Data in the table is stored range-partitioned by the key field. Within partitions, records are stored in Parquet files in S3. These files contain records in sorted order (sorted by the key field and then by the sort field).

Sleeper is deployed using CDK. Each bit of functionality is deployed using a separate CDK substack of one main stack.

The following functionality is experimental:

Sleeper provides the tools to implement fine-grained security on the data, although further work is needed to make these easier to use. Briefly, the following steps are required:

License

Sleeper is licensed under the Apache 2 license.

Documentation

See the documentation contained in the docs folder:

  1. Getting started
  2. Deployment guide
  3. Creating a schema
  4. Tables
  5. Ingesting data
  6. Checking the status of the system
  7. Retrieving data
  8. Python API
  9. Trino
  10. Deploying to LocalStack
  11. Developer guide
  12. Dependency conflicts
  13. Design
  14. System tests
  15. Release process
  16. Common problems and their solutions
  17. Roadmap