apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.4k stars 3.68k forks source link

Centralize datasource schema management in Coordinator #14989

Open findingrish opened 12 months ago

findingrish commented 12 months ago

Motivation

Original proposal https://github.com/abhishekagarwal87/druid/blob/metadata_design_proposal/design-proposal.md#3-proposed-solution-storing-segment-schema-in-metadata-store

In summary, the current approach of constructing table schemas, involving brokers querying data nodes and tasks for segment schemas has several limitations and operational challenges. These issues encompass slow broker startup, excessive communication in the system, schema rebuilding on broker startup, and a lack of unified schema owner. Furthermore, it has functional limitations such as inability to query from the deep storage.

The proposed solution is to centralize schema management within the coordinator. This involves tasks publishing their schemas in the metadata database, along with segment row count information. The coordinator can then build the table schema by combining individual segment schema within the datasource.

Design

Changes are required in tasks, coordinator and broker. Detailed design in individual PRs.

Phases

The first phase is to move existing schema building functionality from the brokers to the coordinator and allow the broker to query schema from the coordinator, while retaining the capability to build table schema if the need arises.

The next step is to have the coordinator publish segment schema in the background to reduce the volume of segment metadata queries during coordinator startup.

In parallel, tasks should be updated to publish their schema in the database. Eventually, eliminating the need to query segment schema directly from data nodes and tasks.

Changes are also required to fetch and publish schema for cold tier segments. This can be done in the Coordinator.

Future work, involves serving system table queries from the Coordinator.

findingrish commented 12 months ago

Pr to move schema building functionality to coordinator https://github.com/apache/druid/pull/14985