Closed gangliao closed 3 years ago
Step Functions is a serverless orchestration service (or state machine) that lets us combine AWS Lambda functions and other AWS services to build a dataflow stream processing application.
ServerlessCQ parses continuous queries, generates abstract syntax trees, and transforms optimized logical plans to state machines [1] on AWS, in which each state or task is a lambda function.
For FaaS, users have to provide function implementations for their applications. Moreover, for complex applications that require multiple stages of functions, there are often many valid evaluation strategies and execution orders. AWS Step Functions (empowered by AWS State Language) is an emerging approach that allows users to orchestrate functions as they need. However, AWS State Language's workflow model has its own set of limitations. For example, to reap the benefits of FaaS --- pay-as-you-go and auto-scaling, users are forced to manually map each distinct query to a brand new dataflow execution model.
This mapping is unnatural to continuous queries. For each query, having users orchestrate relevant cloud functions through vendor lock-in language is equivalent to asking users to specify physical execution plans directly in database systems.
ServerlessCQ jumped at the chance to build a transpiler at the client CLI, which is a special type of compiler so that customers and engineers can get out of the business of manual SQL translation (to Amazon State Language).
ServerlessCQ parses the input query using a formal grammar-based lexer and parser (ANTLR), then translates that to an intermediate representation. This allows us to directly translate SQL queries from the client-side to dataflow models (e.g. AWS Step Functions) on the cloud.
This implies new opportunities for optimizing adaptive query execution based on cloud provisioning. Unlike traditional systems, ServerlessCQ has the potential to be the first cloud-native streaming system that supports SQL on the serverless functions.
SCOPE is a very interesting declarative language. It allows users to focus on the data transformations required to solve the problem at hand and hides the complexity of the underlying platform and implementation details. The SCOPE compiler and optimizer are responsible for generating an efficient execution plan (MapReduce) and the runtime for executing the plan with minimal overhead.
ServelessCQ shares a similar goal but in a new context -- FaaS. For example, the output of the compilation of a SCOPE script consists of three components:
Many optimizations from SCOPE can be applied to ServerelessCQ. Furthermore, ServerlessCQ has more optimization strategies on the cloud.
What is different from the previous system is that ServerelessCQ completely abandoned the job scheduler and instead let the cloud provider delegate the job scheduling.
System Design:
Client:
+----------------------------------------------------------------------+
| Client (CLI, Java, C++, Go, Rust...) |
|----------------------------------------------------------------------|
| |
| |
| +--------+ +-------------------------+ +---------------+ |
| | SQL +---->| Logical Plan(Optimized) +----->| Physical Plan | |
| +--------+ +-------------------------+ +---------------+ |
| |
+----------------------------------------------------------------------+
Physical plans are typically described as a hierarchical structure where each node has an explicit way to execute a particular operation. Any machine, as long as it imports the datafusion library [1] and obtains the physical plan, it can directly perform the corresponding calculations. This means that each operator (max, min, sum, order, aggregate, join(maybe)) does not need to be manually written, but directly calls the operations that Arrow comes with. Arrow itself has a memory format for columnar data, and each operation is optimized by SIMD.
AWS:
After serializing the physical plan [2], ServerlessCQ sends and splits the serialized physical plan into each Lambda function through the payload of the Lambda function. After completing the initialization phase, ServerlessCQ triggers the workflow according to the time window specified by SQL. Each Lambda function executes operators refer to the partial physical plan it has.
Step 1: Initialization
After the plan is partially deployed in the current lambda function, the remaining part of the plan will propagate downward.
+-------------------------------------------------------------------------------+
| AWS |
|-------------------------------------------------------------------------------|
| |
| xxxxxxxxxx lambda |
| x xxx +--------+ |
| x Source x | | |
| xxxxxxxxx+xx lambda +-->| |------+ |
| x | +-------+ | +--------+ | |
| +----->| | | | |
| | |--| lambda v lambda |
| | | | +--------+ +-------+ |
| +-------+ |-->| | | |+--------+ |
| | | |-->| | | |
| | +--------+ +-------+ | |
| | lambda ^ v |
| | +--------+ | xxxxxxxxxxxxx |
| | | | | xxx x |
| +-->| |------+ xxx Sink x |
| +--------+ xxxxx xx |
| xxxxx |
| |
| |
+-------------------------------------------------------------------------------+
After the initialization phase, all lambda functions include a part of the execution plan.
+-------------------------------------------------------------------------------+
| AWS |
|-------------------------------------------------------------------------------|
| |
| xxxxxxxxxx lambda |
| x xxx +--------+ |
| x Source x |partial | |
| xxxxxxxxx+xx lambda +-->|join |------+ |
| x | +-------+ | +--------+ | |
| +----->|scan | | | |
| |project|--| lambda v lambda |
| |filter | | +--------+ +-------+ |
| +-------+ |-->|partial | |sort |+--------+ |
| | |join |-->|limit | | |
| | +--------+ +-------+ | |
| | lambda ^ v |
| | +--------+ | xxxxxxxxxxxxx |
| | |partial | | xxx x |
| +-->|join |------+ xxx Sink x |
| +--------+ xxxxx xx |
| xxxxx |
| |
| |
+-------------------------------------------------------------------------------+
The introduction of the initialization phase has two benefits: (1) deployment of the execution plan (2) Keep lambda functions warm and avoid cold starts.
Step 1: Dataflow
The SQL time window will generate a cron expression in the CloudWatch[3]. It will trigger the workflow, and then all the data from the source is read, and the distributed dataflow computation is started.
Continuous query using SQL is one requirement of real-time stream processing [1].
This project has great potential to make another contribution to simplifying the serverless programming model via directly recasting stream queries into data flow models where each operator or node in the graph is a cloud function service managed by AWS Step Functions [3]. In this way, we can support graphs cyclic dataflows and iterations on streams.
Materialize does all of this by recasting SQL92 queries as dataflows. We can go further and directly convert SQL as FaaS dataflow. It's quite easy to parse SQL and support custom dialects [2].
References