Closed 599166320 closed 2 months ago
Is the data on different data centers disjoint? can you take us through how a query will be executed in such a world? Also, how would an average aggregation work assuming the key you are grouping on exists in multiple datacentres?
@abhishekagarwal87
These are great questions.
Let's assume we continue to analyze the query execution process based on the Druid cluster deployment architecture mentioned earlier.
A user in the Singapore Druid console executes an SQL query to calculate the average, as follows:
SELECT AVG(delta), SUM(delta)/COUNT(*) from wikipedia
When the SQL statement reaches the broker process in Singapore, the broker converts the SQL into a native query
. During the conversion, it transforms AVG
into SUM
and COUNT
.
Then, it distributes the native query to
all Historical nodes in Singapore. Simultaneously, it checks the context for the presence of federatedClusterBrokers.
If they exist, it conveniently distributes the native query
to all Druid clusters that require federated queries. This distribution is carried out through the entry broker of the federated cluster, which further dispatches the query to the Historical nodes of each federated cluster.
The broker of each federated cluster aggregates its own cluster's data and returns it to the Singapore broker. The Singapore broker aggregates all the data and responds to the user. In summary, federatedClusterBrokers serve a function similar to a Historical node.
In general, whether the data from different data centers is disjoint is irrelevant. Aggregation functions such as AVG
, COUNT(DISTINCT x)
, and others can be handled without concern.
This is interesting but is hacky even if it seems to work. Brokers fetching data from other brokers assuming they are like data nodes break certain assumptions that are baked into broker-historical protocol.
Yes, there are some design issues. I considered these changes to be quite minimal, and for our own usage, the benefits outweighed the drawbacks, so we directly reused the forwarding and aggregation capabilities of the broker. Is there a better way to support federated queries?
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Motivation
Data centers may be located in different regions and provided by different cloud providers. Usually, dedicated network connections are used to interconnect the networks of these data centers, facilitating data transmission. However, transferring data between data centers incurs expensive bandwidth costs and significant transmission delays. Therefore, it is essential to minimize data transfer between different data centers as much as possible. Rather than centralizing data storage in a single data center for all regions, we prefer to report data to the nearest data center. However, this approach poses certain challenges. If there are a large number of data centers, maintaining numerous independent Druid clusters becomes necessary. As these clusters operate independently, it can result in inaccurate aggregated queries. By enabling support for federated clusters in Druid databases, substantial cost savings in terms of bandwidth can be achieved.
Description
Here is one possible deployment architecture:
To enable federated queries, the "federatedClusterBrokers" property should be added to the context. Here is an example query:
The above is the current approach I am using to implement a federated cluster with Druid. Are there any better ways to support federated queries in Druid clusters?