apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
11.8k stars 3.11k forks source link
bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Apache Doris

License GitHub release Jenkins Vec Total Lines [Join the Doris Community on Slack EN doc CN doc

Apache Doris is an MPP-based real-time data warehouse known for its high query speed. For queries on large datasets, it returns results in sub-seconds. It supports both high-concurrency point queries and high-throughput complex analysis. It can be used for report analysis, ad-hoc queries, unified data warehouse building, and data lake query acceleration. Based on Apache Doris, users can build applications for user behavior analysis, A/B testing platform, log analysis, and e-commerce order analysis.

Please visit our official download page to get the latest release version.

The current stable version is the 2.0.x series, and the latest version is the 2.1.x series. For production, it is recommended to use the latest version of the 2.0.x series. And if used for POC or testing, it is recommended to use the latest version of the 2.1.x series.

๐Ÿ‘€ Have a look at the ๐Ÿ”—Official Website for a comprehensive list of Apache Doris's core features, blogs and user cases.

๐Ÿ“ˆ Usage Scenarios

As shown in the figure below, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Apache Doris and the offline data lake or data warehouse (in Apache Hive, Apache Iceberg or Apache Hudi).

Apache Doris is widely used in the following scenarios:

๐Ÿ–ฅ๏ธ Core Concepts

๐Ÿ“‚ Architecture of Apache Doris

The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.

Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.

The overall architecture of Apache Doris

In terms of interfaces, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools.

๐Ÿ’พ Storage Engine

Doris uses a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources. Doris supports various index structures to minimize data scans:

๐Ÿ’ฟ Storage Models

Doris supports a variety of storage models and has optimized them for different scenarios:

Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users.

๐Ÿ” Query Engine

Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries.

The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5โ€“10 times higher performance in wide table aggregation scenarios than non-vectorized engines.

Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter.

๐Ÿš… Query Optimizer

In terms of optimizers, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction.

Technical Overview: ๐Ÿ”—Introduction to Apache Doris

๐ŸŽ† Why choose Apache Doris?

๐Ÿ™Œ Contributors

Apache Doris has graduated from Apache incubator successfully and become a Top-Level Project in June 2022.

Currently, the Apache Doris community has gathered more than 600 contributors from over 200 companies in different industries, and the number of monthly active contributors exceeds 100.

Monthly Active Contributors

Contributor over time

We deeply appreciate ๐Ÿ”—community contributors for their contribution to Apache Doris.

๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ Users

Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in thousands of companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Sina, 360, Mihoyo, and Ke Holdings. It is also widely used in some traditional industries such as finance, energy, manufacturing, and telecommunications.

The users of Apache Doris: ๐Ÿ”—Users

Add your company logo at Apache Doris Website: ๐Ÿ”—Add Your Company

๐Ÿ‘ฃ Get Started

๐Ÿ“š Docs

All Documentation ๐Ÿ”—Docs

โฌ‡๏ธ Download

All release and binary version ๐Ÿ”—Download

๐Ÿ—„๏ธ Compile

See how to compile ๐Ÿ”—Compilation

๐Ÿ“ฎ Install

See how to install and deploy ๐Ÿ”—Installation and deployment

๐Ÿงฉ Components

๐Ÿ“ Doris Connector

Doris provides support for Spark/Flink to read data stored in Doris through Connector, and also supports to write data to Doris through Connector.

๐Ÿ”—apache/doris-flink-connector

๐Ÿ”—apache/doris-spark-connector

๐ŸŒˆ Community and Support

๐Ÿ“ค Subscribe Mailing Lists

Mail List is the most recognized form of communication in Apache community. See how to ๐Ÿ”—Subscribe Mailing Lists

๐Ÿ™‹ Report Issues or Submit Pull Request

If you meet any questions, feel free to file a ๐Ÿ”—GitHub Issue or post it in ๐Ÿ”—GitHub Discussion and fix it by submitting a ๐Ÿ”—Pull Request

๐Ÿป How to Contribute

We welcome your suggestions, comments (including criticisms), comments and contributions. See ๐Ÿ”—How to Contribute and ๐Ÿ”—Code Submission Guide

โŒจ๏ธ Doris Improvement Proposals (DSIP)

๐Ÿ”—Doris Improvement Proposal (DSIP) can be thought of as A Collection of Design Documents for all Major Feature Updates or Improvements.

๐Ÿ”‘ Backend C++ Coding Specification

๐Ÿ”— Backend C++ Coding Specification should be strictly followed, which will help us achieve better code quality.

๐Ÿ’ฌ Contact Us

Contact us through the following mailing list.

Name Scope
dev@doris.apache.org Development-related discussions Subscribe Unsubscribe Archives

๐Ÿงฐ Links

๐Ÿ“œ License

Apache License, Version 2.0

Note Some licenses of the third-party dependencies are not compatible with Apache 2.0 License. So you need to disable some Doris features to be complied with Apache 2.0 License. For details, refer to the thirdparty/LICENSE.txt