In summary, the current approach of constructing table schemas, involving brokers querying data nodes and tasks for segment schemas has several limitations and operational challenges. These issues encompass slow broker startup, excessive communication in the system, schema rebuilding on broker startup, and a lack of unified schema owner. Furthermore, it has functional limitations such as inability to query from the deep storage.
The proposed solution is to centralize schema management within the coordinator. This involves tasks publishing their schemas in the metadata database, along with segment row count information. The coordinator can then build the table schema by combining individual segment schema within the datasource.
Design
Changes are required in tasks, coordinator and broker.
Detailed design in individual PRs.
Phases
The first phase is to move existing schema building functionality from the brokers to the coordinator and allow the broker to query schema from the coordinator, while retaining the capability to build table schema if the need arises.
The next step is to have the coordinator publish segment schema in the background to reduce the volume of segment metadata queries during coordinator startup.
In parallel, tasks should be updated to publish their schema in the database. Eventually, eliminating the need to query segment schema directly from data nodes and tasks.
Changes are also required to fetch and publish schema for cold tier segments. This can be done in the Coordinator.
Future work, involves serving system table queries from the Coordinator.
Motivation
Original proposal https://github.com/abhishekagarwal87/druid/blob/metadata_design_proposal/design-proposal.md#3-proposed-solution-storing-segment-schema-in-metadata-store
In summary, the current approach of constructing table schemas, involving brokers querying data nodes and tasks for segment schemas has several limitations and operational challenges. These issues encompass slow broker startup, excessive communication in the system, schema rebuilding on broker startup, and a lack of unified schema owner. Furthermore, it has functional limitations such as inability to query from the deep storage.
The proposed solution is to centralize schema management within the coordinator. This involves tasks publishing their schemas in the metadata database, along with segment row count information. The coordinator can then build the table schema by combining individual segment schema within the datasource.
Design
Changes are required in tasks, coordinator and broker. Detailed design in individual PRs.
Phases
The first phase is to move existing schema building functionality from the brokers to the coordinator and allow the broker to query schema from the coordinator, while retaining the capability to build table schema if the need arises.
The next step is to have the coordinator publish segment schema in the background to reduce the volume of segment metadata queries during coordinator startup.
In parallel, tasks should be updated to publish their schema in the database. Eventually, eliminating the need to query segment schema directly from data nodes and tasks.
Changes are also required to fetch and publish schema for cold tier segments. This can be done in the Coordinator.
Future work, involves serving system table queries from the Coordinator.