Existing analytical tools

Data warehouse (DWH) in the 90s

Major DWH vendors:

Informatica
Vertica
Netezza
Terradata
IBM

The DWH was a mandatory item in most digital corporations back in the 90s. But they were slow. It wasn’t a significant problem since they were used by a handful of patient analysts in the organizations.

The birth of data lakes

Since the data had many variations, it was no longer possible to store them on traditional relational databases.

So, a new class of reliable, scalable, and cheap storage systems, called data lake, was born after being influenced by HDFS. Having a schema was optional at the point of writing data to it.

AWS also launched S3 by this time in 2006, and people quickly embraced it because there was no infrastructure to manage, unlike HDFS.

From Hive, Drill, to Presto

Facebook open-sourced their analytics engine Presto which made big waves in the ad-hoc data analytics space. Engines like Presto and Drill were instrumental in running federated SQL queries across multiple disparate data sources.

Redshift, BigQuery, and Snowflake

DWH vendors didn’t like to lose their market share as well. AWS, Google, and Snowflake launched their cloud-native versions of DWH to the market, and it was a huge success.

What is the problem now?

Compared to the early days, the time to analyze a terabyte-scale data set has been reduced to months to days to single-digit seconds. Ad-hoc analytical databases like Presto or even modern DWH can respond to queries with seconds’ latency.

But modern applications demand it to be in the millisecond’s range.

Can pre-aggregations help? DWH traditionally used pre-aggregations (using ETL tools based on Spark, Hadoop, or Hive) and rollup cubes (such as Kylin) to speed up the queries. But pre-cubing had a significant footprint on the storage and computational power, leading to delays in data ingestion, impacting the data freshness.

Real-time OLAP databases to save the world

The primary goal of Real-time OLAP databases is to:

Answer around 100k-200k QPS.
Respond to queries with stringent latency SLA of milliseconds. Anything that goes beyond a second won’t be tolerated.
Eliminate pre-aggregations and allow users to feed raw data in multiple formats, including JSON.
Maintain data freshness (must allow data to be queried as they are ingested)
Combine both streaming and historical data to answer queries more accurately while maintaining data freshness.
Use expressive SQL queries that often have complex aggregates and filter predicates.

Druid, Pinot, ClickHouse, and Rockset

These are the key real-time OLAP databases. They use a combination of intelligent indexing and segment placement and query pruning strategies to bring down the query execution time.

The compute and storage layers of these databases are often tightly coupled. The ingested data will be broken into smaller segments and laid out in the disk in a columnar format. While in operations, segments are memory-mapped, enabling high performant query processing.

Moreover, the tightly-coupled storage enables the query engine to evolve with storage in tandem.

Where next?

There are other movements in this space:

Delta lakes
Lakehouse architectures
Data mesh

No one-size-fits-all type of OLAP database that exists:

You can still live with an old DWH if you are okay with dealing with slow queries.
If you try to instill a data-driven culture with many experiments, machine learning, and data science, a data lake strategy would be ideal.
To deal with latency-sensitive use cases like user-facing dashboards and personalization, you can consider a real-time OLAP database.
To easily get done with everyday operational BI needs, you can use a cloud-based DWH.

Sunt-ing / database-system-readings

The Origins of OLAP Databases and Where Are They Heading in 2022? #113