Sunt-ing / database-system-readings

:yum: A curated reading list about database systems
MIT License
466 stars 31 forks source link

The Origins of OLAP Databases and Where Are They Heading in 2022? #113

Closed Sunt-ing closed 2 years ago

Sunt-ing commented 2 years ago

https://medium.com/event-driven-utopia/the-origins-of-olap-databases-and-where-are-they-heading-in-2022-28baa6fef417

Sunt-ing commented 2 years ago

Existing analytical tools

Data warehouse (DWH) in the 90s

Major DWH vendors:

The DWH was a mandatory item in most digital corporations back in the 90s. But they were slow. It wasn’t a significant problem since they were used by a handful of patient analysts in the organizations.

The birth of data lakes

Since the data had many variations, it was no longer possible to store them on traditional relational databases.

So, a new class of reliable, scalable, and cheap storage systems, called data lake, was born after being influenced by HDFS. Having a schema was optional at the point of writing data to it.

AWS also launched S3 by this time in 2006, and people quickly embraced it because there was no infrastructure to manage, unlike HDFS.

From Hive, Drill, to Presto

Facebook open-sourced their analytics engine Presto which made big waves in the ad-hoc data analytics space. Engines like Presto and Drill were instrumental in running federated SQL queries across multiple disparate data sources.

Redshift, BigQuery, and Snowflake

DWH vendors didn’t like to lose their market share as well. AWS, Google, and Snowflake launched their cloud-native versions of DWH to the market, and it was a huge success.

What is the problem now?

Compared to the early days, the time to analyze a terabyte-scale data set has been reduced to months to days to single-digit seconds. Ad-hoc analytical databases like Presto or even modern DWH can respond to queries with seconds’ latency.

But modern applications demand it to be in the millisecond’s range.

Can pre-aggregations help? DWH traditionally used pre-aggregations (using ETL tools based on Spark, Hadoop, or Hive) and rollup cubes (such as Kylin) to speed up the queries. But pre-cubing had a significant footprint on the storage and computational power, leading to delays in data ingestion, impacting the data freshness.

Real-time OLAP databases to save the world

The primary goal of Real-time OLAP databases is to:

Druid, Pinot, ClickHouse, and Rockset

These are the key real-time OLAP databases. They use a combination of intelligent indexing and segment placement and query pruning strategies to bring down the query execution time.

The compute and storage layers of these databases are often tightly coupled. The ingested data will be broken into smaller segments and laid out in the disk in a columnar format. While in operations, segments are memory-mapped, enabling high performant query processing.

Moreover, the tightly-coupled storage enables the query engine to evolve with storage in tandem.

Where next?

There are other movements in this space:

No one-size-fits-all type of OLAP database that exists: