Link: http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdfRecommend reason: "Lakehouse" is popular these days, and here is where the "Lakehouse" concept comes from.
Once I thought the concept "lakehouse" is proposed by Alibaba Cloud, but actually, it is not.
Key idea: use S3 to implement data warehouse
Key features of Lakehouse:
Transaction support
Storage is decoupled from compute
High data recency
Support for diverse workloads
Use open storage format such as Parquet
Provide an API to enable direct access from various applications, such as machine Python/R libraries
Have first-class support for machine learning and data science
Benefits for AI: Why use a lakehouse instead of a data lake for AI? A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data.
How can Lakehouse be faster using open data format rather than propriety data format?
One of the main motivations for us introducing Delta Lake was to introduce additional capabilities that were difficult to do at the Parquet layer. Delta Lake brought additional indexing and statistics to Parquet.
For TPC-DS, querying data cached in a more optimized internal format is only 10% faster than querying cold data in S3. For these workloads, optimization opportunities come primarily from the ability to process the queries faster, instead of scanning more data faster.
How can Databricks make Lakehouse mature within such a short time?
Databricks has focused on SQL workloads for a long time.
The SaaS model has accelerated software development cycles.
Link: http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf Recommend reason: "Lakehouse" is popular these days, and here is where the "Lakehouse" concept comes from. Once I thought the concept "lakehouse" is proposed by Alibaba Cloud, but actually, it is not.
Key idea: use S3 to implement data warehouse
Key features of Lakehouse:
Benefits for AI: Why use a lakehouse instead of a data lake for AI? A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data.
Databricks answers some questions about Lakehouse: Databricks Sets Official Data Warehousing Performance Record
Reference: Blog post by Databricks: https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html