Shannon-Data / ShannonBase

A MySQL HTAP Database, Open Source version of MySQL Heatwave, Powered by AI.
https://www.shannonbase.org
Other
14 stars 6 forks source link

feat(shannon): Architecture of ShannonBase #24

Closed ShannonBase closed 5 months ago

ShannonBase commented 1 year ago

Before the first

Shannonbase is an AI/ML empowered open source MySQL HTAP Database. which utilize AI/ML to enhance the AI/ML ability of Shannonbase.

With these employed features, shannonbase will has autonomous tuning, workloads predicting, auto index choice, index recommendation, selectivity estimation, etc. (ref: a phd dissertation from UC berkeley by Zongheng Yang and MySQL heatwave.

In order to support analytical processing in mysql, shannonbase incorporate a secondary engine, which is an in-memory column engine to process analytical workloads. The secondary engine is new feature, which is introduced in MySQL 8.x, it provides an interface to support multi-model, heterogeneous database. It synchronize the data from primary engine(InnoDB) to secondary engine(Rapid).

Now, our in-memory column sotre, also called Rapid.

It will be based on MySQL 8.1. It aims to achieve at least x times higher query performance on TPC-H than that of xxx, with a more concise architecture and a query optimizer that can intelligently offload query workloads to corresponding storage engines.

The design philosophy of ShannonBase Rapid is modularity and performance-cost balance. The following outlines the new features that will be implemented in ShannonBase. To learn details about each feature, see the relevant chapter.

ShannonBase Rapid will still be an open source project, which is a counterpart of close source service, MySQL Heatwave.

At first, an in-memory column store (IMCS) will be used. Secondly, a cost-based query engine will be developed to automatically offload transactional and analytics workloads. Thirdly, ShannonBase Rapid will provide a vectorized execution engine and support massive parallel processing. In this way, the execution performance of ShannonBase Rapid will be at least xxx times as that of xxx.

ShannonBase will load the data into memory from InnoDB to Rapid, just the same as MySQL Heatwave does.

’‘’MySQL Analytics is an in-memory processing engine, data is only persisted in MySQL InnoDB storage engine.‘’‘

This sentence functions as the basic rule and guideline for us when implementation ShannonBase Rapid. This design document introduces the main changes that will be achieved and gives you an overview of architecture of ShannonBase.

The main design goals of ShannonBase will include:

  1. Large Scale.
  2. Real Time.
  3. Highly Fresh Data Changes.
  4. Strong Data Consistency.
  5. Query Performance Capability.
  6. Single System Interface.
  7. Workload Isolation.

    1: Overview of ShannonBase Rapid

    ShannonBase is an integrated HTAP database that adopts hybrid row-column store and in-memory computing. It is fully compatible with MySQL version 8.1.

The architecture overview of ShannonBase image

In MySQL 8.0, it provides the secondary engine which can intelligently route the TP workloads to the primary engine(InnonDB) and routes the AP workloads to secondary engine(Rapid), all these operations are based on the workload type. image

2: Query Engine

After all new SQL syntaxes are enabled, the server will understand all the SQL statements. When the server receives an SQL statement, the SQL string will create some SQL classes, such as PT_create_table_stmt after lexical processing and grammatical processing. We will not discuss how distributed query plans in MPP are generated in this document. Instead, we focus on ONE NODE and try to explain what happens in ONE node when processing an SQL statement.

In MySQL 8.0, when the cost of a query plan on the primary engine is greater than the threshold defined by the new system variable (secondary_engine_cost_threshold), the query optimization engine will offload this workload to the secondary engine, ensuring optimal processing efficiency.

At the last phase of query optimization, The query engine will add optimize_secondary_engine to determine to which engine will the workload route for execution by performing the following three steps:

  1. Use the original processing way: unit->optimize().

  2. Estimate the cost spent by each engine to process the query: current_query_cost and accumulate_current_query_cost.

  3. If current_query_cost is greater than secondary_engine_cost_threshold, forward the workload to optimize_secondary_engine.

if (current_query_cost < variables.secondary_engine_cost_threshold) 
    return false; 
optimize_secondary_engine;

image

In future, after ShannonBase achieves MPP, the way that ShannonBase processes SQL statements will be different from centralized systems. A distributed query plan will be generated after query optimization is complete.

3: Execution Engine

As for the execution engine, a vectorized execution engine will be incorporated in ShannonBase. Shannonbase will support parallel query and vectorized execution. A column-based AP system is native to implement a vectorized execution engine. The vectorized engine seems as a standard feature to improve the performance of AP workloads. RDBMS systems such as ClickHouse also use vectorized execution engines.

Two ways are available to achieve vectorized execution as following:

  1. Use SIMD (single instruction, multiple data) to re-write execution plans.
  2. Multiple tuples will be fetched in an iteration, rather than a-tuple-an-iteration.
  3. Use GCC to generate vectorized code.

Some aggregation functions such as count(), sum(), and avg() can be executed in parallel mode. After a query plan is dispatched on a data node through the management node, the execution engine executes this query plan in parallel and the job is divided into sub-jobs and simultaneously executed through threads. The framework of parallel execution is discussed in issue #xxxx. You can refere to MySQL NDB cluster.

MySQL Cluster has a unique parallel query engine. It gives a consistent consolidated transactional view of the entire distributed partitioned dataset. This makes designing and programming scaleable distributed applications straightforward and extremely simple. The developer can completely focus on the application logic and doesn’t need to think about data distribution.

4: Rapid, A In-Memory Column Store Secondary Engine

Ref #2

5: Crash recovery

Rapid engine is an in-memory column engine, memory is volatile storage media, and the data in rapid will lose when engine goes crash. Therefore, crash safe is a key feature for any databases. Without crash safe, it cann't be used in production. And, for another reason, crash safe brings fast recovery ability due to all the data in rapid has been flushed to disk in compressed format. (The compressed format reduces the disk usage.). There are some scenarios:

7: Backup & Restore

8: Replication

9: Data Security

ShannonBase commented 5 months ago

ref to wiki, now close it.