Closed Sunt-ing closed 2 years ago
Major DWH vendors:
The DWH was a mandatory item in most digital corporations back in the 90s. But they were slow. It wasn’t a significant problem since they were used by a handful of patient analysts in the organizations.
Since the data had many variations, it was no longer possible to store them on traditional relational databases.
So, a new class of reliable, scalable, and cheap storage systems, called data lake, was born after being influenced by HDFS. Having a schema was optional at the point of writing data to it.
AWS also launched S3 by this time in 2006, and people quickly embraced it because there was no infrastructure to manage, unlike HDFS.
Facebook open-sourced their analytics engine Presto which made big waves in the ad-hoc data analytics space. Engines like Presto and Drill were instrumental in running federated SQL queries across multiple disparate data sources.
DWH vendors didn’t like to lose their market share as well. AWS, Google, and Snowflake launched their cloud-native versions of DWH to the market, and it was a huge success.
Compared to the early days, the time to analyze a terabyte-scale data set has been reduced to months to days to single-digit seconds. Ad-hoc analytical databases like Presto or even modern DWH can respond to queries with seconds’ latency.
But modern applications demand it to be in the millisecond’s range.
Can pre-aggregations help? DWH traditionally used pre-aggregations (using ETL tools based on Spark, Hadoop, or Hive) and rollup cubes (such as Kylin) to speed up the queries. But pre-cubing had a significant footprint on the storage and computational power, leading to delays in data ingestion, impacting the data freshness.
The primary goal of Real-time OLAP databases is to:
These are the key real-time OLAP databases. They use a combination of intelligent indexing and segment placement and query pruning strategies to bring down the query execution time.
The compute and storage layers of these databases are often tightly coupled. The ingested data will be broken into smaller segments and laid out in the disk in a columnar format. While in operations, segments are memory-mapped, enabling high performant query processing.
Moreover, the tightly-coupled storage enables the query engine to evolve with storage in tandem.
There are other movements in this space:
No one-size-fits-all type of OLAP database that exists:
https://medium.com/event-driven-utopia/the-origins-of-olap-databases-and-where-are-they-heading-in-2022-28baa6fef417