Sleeper
Introduction
Sleeper is a serverless, cloud-native, log-structured merge tree based, scalable key-value store. It is designed to
allow the ingest of very large volumes of data at low cost. Data is stored in rows in tables. Each row has a key field,
and an optional sort field, and some value fields. Queries for rows where the key takes a given value takes around
1-2 seconds, but many thousands can be run in parallel. Each individual query has a negligible cost.
Sleeper can be thought of as a cloud-native reimagining of systems such as Hbase and Accumulo. The architecture is
very different to those systems. Sleeper has no long running servers. This means that if there is no work to be done,
i.e. no data is being ingested and no background operations such as compactions are in progress, then the only cost
is the cost of the storage. There are no wasted compute cycles, i.e. it is "serverless".
The current codebase can only be deployed to AWS, but there is nothing in the design that limits it to AWS. In time
we would like to be able to deploy Sleeper to other public cloud environments such as Microsoft Azure
or to a Kubernetes cluster.
Note that Sleeper is currently a prototype. Further development and testing is needed before it can be considered
to be ready for production use.
Functionality
Sleeper stores records in tables. A table is a collection of records that conform to a schema. A record is a map
from a field name to value. For example, a schema might have a row key field called 'id' of type string, a sort
field called 'timestamp' of type long, and a value field called 'name' of type string. Each record in a table
with that schema is a map with keys of id, timestamp and name. Data in the table is stored range-partitioned by
the key field. Within partitions, records are stored in Parquet files in S3. These files contain records in sorted
order (sorted by the key field and then by the sort field).
Sleeper is deployed using CDK. Each bit of functionality is deployed using a separate CDK substack of one main
stack.
- State store stacks: Each table has a state store that stores metadata about the table such as files
that are in the table, the partitions they are in and information about the partitions themselves.
- Compaction stack: As the number of files in a partition increases, their contents must be merged
("compacted") into a single sorted file. The compaction stack performs this task using lambda and SQS for job
creation, queueing and Fargate/EC2 for execution of the tasks.
- Garbage collector stack: After compaction jobs have completed, the input files are deleted (after a user
configurable delay to ensure the files are not in use by queries). The garbage collector stack performs this
task using a lambda function.
- Partition splitting stack: Partitions need to be split once they reach a user configurable size. This only
affects the metadata about partitions in the state store. This task is performed using a lambda.
- Ingest stack: This allows the ingestion of data from Parquet files. These files need to consist of data that
matches the schema of the table. They do not need to be sorted or partitioned in any way. Ingest job requests
are submitted to an SQS queue. They are then executed by tasks running on an ECS cluster. These tasks are scaled
up using a lambda that periodically monitors the number of items on the queue. The tasks scale down naturally
as if there are no jobs on the queue a task will terminate.
- Query stack: This stack allows queries to be executed in lambdas. Queries are submitted to an SQS queue. This
triggers a lambda that executes the queries. It can then write the results to files in an S3 bucket, or send the
results to an SQS queue. A query can also be executed from a websocket client. In this case the results will be
returned directly to a client. Note that queries can also be submitted directly from a Java client, and this does
not require the query stack.
- Topic stack: Notifications of errors and failures will appear on an SNS topic.
- EMR bulk import stack: This allows large volumes of data in Parquet files to be ingested. A bulk import job
request is sent to an SQS queue. This triggers a lambda that creates an EMR cluster. This cluster runs a Spark
job to import the data. Once the job is finished the EMR cluster shuts down.
- Persistent EMR bulk import stack: Similar to the above stack, but the EMR cluster is persistent, i.e. it never
shuts down. This is appropriate if there is a steady stream of import jobs. The cluster can either be of fixed
size or it can use EMR managed scaling.
- EMR Serverless Bulk Import stack: Similar to the above 2 stacks in behaviour. This stack is created at Sleeper
instance deployment. It is the default way in which the bulk imports are run and provides benefit by the fact that
when no bulk import jobs are present, no computing resources are used.
- Dashboard stack: This displays properties of the system in a Cloudwatch dashboard.
The following functionality is experimental:
- Athena: There is experimental integration with Athena to allow SQL analytics to be run against a table.
- Trino: There is a plugin for Trino that allows SQL select, join and insert statements to be run against multiple
Sleeper tables. It returns results quickly and it is particularly suitable for interactive queries which retrieve just
a few rows from the underlying Sleeper instance, such as SELECT * FROM t1 INNER JOIN t2 USING (keycol) WHERE t1.name =
"Fred Jones"
- Bulk import using Spark running on EKS: This stack allows data to be ingested by running Spark on EKS. Currently
this runs the Spark executors in either Fargate or EC2.
Sleeper provides the tools to implement fine-grained security on the data, although further work is needed to make
these easier to use. Briefly, the following steps are required:
- Decide how to store the security information in the table, e.g. there might be one security label per row,
or two per row, or one per cell. These fields must be added to the schema.
- Write an iterator that will run on the results of every query to filter out records that a user is not permitted
to see. This takes as input a user's authorisations and uses those to make a decision as to whether the user can see
the data.
- Ensure the Sleeper instance is deployed such that the boundary of the system is protected.
- Ensure that queries are submitted to the query queue via a service that authenticates users, and passes their
authorisations into the query time iterator configuration.
License
Sleeper is licensed under the Apache 2 license.
Documentation
See the documentation contained in the docs folder:
- Getting started
- Deployment guide
- Creating a schema
- Tables
- Ingesting data
- Checking the status of the system
- Retrieving data
- Python API
- Trino
- Deploying to LocalStack
- Developer guide
- Dependency conflicts
- Design
- System tests
- Release process
- Common problems and their solutions
- Roadmap