facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.41k stars 1.12k forks source link

Motivation for this project #1564

Closed alexey-milovidov closed 2 years ago

alexey-milovidov commented 2 years ago

I checked the description and the list of the issues and pull requests of Velox and found out that the problems this project is trying to solve are already solved better with more mature and full featured engines like ClickHouse.

There are also some experimental and research projects like DuckDB and Apache Datafusion (not as mature as ClickHouse but trying to keep up). And there are some abandoned projects similar to Velox, for example this by Google: https://github.com/google/supersonic (abandoned more than 10 years ago). Or this database engine by Dropbox: https://github.com/cswinter/LocustDB Or this (also abandoned): https://github.com/tensorbase/tensorbase/

What is the real goal of Velox and how is it going to compete with others?

pedroerp commented 2 years ago

Hi @alexey-milovidov thanks for reaching out! We understand that the amount of documentation we have about Velox’s main motivations and current status is lean; we are working on improving it before we make a broader public announcement in the near future. We’ll also be sharing more details in our VLDB’22 paper, which will be available soon.

As per our understanding based on the information publicly available, Clickhouse is a full-stack OLAP DBMS. DuckDB, as mentioned by you, is portable and embeddable, but also a full stack DBMS implementation.

Velox is a C++ library that provides reusable, extensible, high-performance, engine-agnostic and dialect-agnostic data processing components. Velox does not contain a language-frontend or query optimizer, and it’s meant to be used to extend and accelerate other engines, not to provide a new one.

Considering Velox is meant to be re-used by engines focused on different workloads, it was built from the ground-up in a generic, modular, and extensible way. Beyond analytics (which is Clickhouse and DuckDB’s niche), Velox is not only integrated today with Presto and Spark, but also with stream processing engines, database ingestion and message buses, monitoring, ML data wrangling and feature engineering platforms, and even transactional engines within Meta.

What is the real goal of Velox and how is it going to compete with others?

With that said, Velox has three main goals:

  1. Efficiency: We democratize optimizations previously only implemented in individual engines (such as Clickhouse, but also Presto, Spark, etc), into a highly curated library, which we believe to provide best-in-class performance.

  2. Consistency: by leveraging the same library, engines can expose the exact same data types, scalar/aggregate function packages, and overall provide a more consistent semantic to users across different engines. The observation here being that users usually interact with a variety of data systems, and dialect and semantic differences are a major engineering productivity issue.

  3. Reusability: all features and optimizations available in Velox are developed and maintained once, thus reducing engineering duplication and promoting reusability. It also speeds up the development of specialized engines targeted to new types of workloads, promoting modularity in data management systems.

– Hope this helps clarify the questions you raised. As usual, please don’t hesitate in reaching out if you would like to get more information or to contribute to the project!

mbasmanova commented 2 years ago

@alexey-milovidov Alexey, did Pedro's reply answer your questions? Do you have any follow-up questions?

alexey-milovidov commented 2 years ago

Thank you, no follow-up questions.