opt: make the optimizer "distribution aware"

rytaft commented 4 years ago

CockroachDB is a geo-distributed database, but the optimizer currently operates with the assumption that all data is located on the same physical machine. There are a couple of exceptions to this rule, such as the deprecated feature called "duplicated indexes". To use this feature, a user can create two (or more) indexes that have exactly the same columns but are pinned to different localities. For example, a user might choose to pin one index to a data center in the East coast of the US, and another identical index to a data center in the West coast. In this case, with all other things being equal, the optimizer will choose whichever index is closest to the gateway node of the SQL query.

There are lots of other cases, however, where the optimizer does not take data locality into account when it should. For example, the optimizer does not consider locality when deciding between two non-identical indexes (i.e., indexes with different columns). This can lead to suboptimal plans in many cases. The purpose of this issue is to track the work needed to make the optimizer aware of data distribution when planning all queries, not just those that use the duplicated indexes feature.

A few of the features we plan to add are:

[x] Support a "distribution" physical property in the optimizer, which describes the localities (e.g., regions) that are touched by a given relational operator.
[x] Support a "distribute" operator in the optimizer, which enforces a particular distribution by routing data from one set of localities to another set. This is similar to how the sort operator enforces a particular ordering. In the execution engine, "distribute" can either represent DistSQL routers or the lookups performed by the DistSender. If data is spread across localities, a "distribute" operator should be used to enforce that all results are ultimately returned to the gateway node.
[ ] Update the optimizer cost model to give a realistic cost to the "distribute" operator. This should take into account the different localities in the input and output distributions and the latency between them.
[ ] Since lookup and index joins rely on the DistSender to fetch data from remote nodes, update the cost model for lookup joins and index joins to include the distribution cost. This will require identifying which localities will be visited by the DistSender when fetching data and using a cost model similar to the "distribute" operator.
[ ] Since distributed hash and merge joins require data shuffling, update the cost model for those operators to include the distribution cost. To prevent the optimizer from always choosing a non-distributed join, account for the benefits of parallel computation in the cost of distributed joins.
[ ] Similarly, update the cost model for distributed hash group by and distinct operators to include the distribution cost. To prevent the optimizer from always choosing a non-distributed operation, account for the benefits of parallel computation.
[ ] Add transformation rules in the optimizer to support different partitioning schemes for distributed joins. For example, the optimizer should explore broadcast joins in addition to hash-partitioned joins (see https://github.com/cockroachdb/cockroach/issues/84731). The optimizer will be able to choose the best partitioning scheme by using the distributed cost model supported by the above steps.
[ ] Add more region-aware transformations in the optimizer that can leverage the distributed cost model. We already have transformations to support locality optimized search, but there are others we could add. For example, if a distributed join is spread across multiple regions and the partitioning column (e.g., crdb_region) is one of the join keys, it may be possible to hash-partition data within a region and perform the join phase before transferring any data across regions. In other words, the optimizer could transform a single join across regions into a union of joins, where each child of the union is a join within a single region. Another possible transformation where we turn a scan of a REGIONAL BY ROW table into a pk-fk join with a GLOBAL table is described in https://github.com/cockroachdb/cockroach/issues/69617#issuecomment-916184772.

This is not an exhaustive list, but gives a sense of the scope of work.

gz#9256

Jira issue: CRDB-5025

rytaft commented 2 years ago

I just updated the issue description above to more closely represent the new reality as of July 2022. The first two tasks were completed in https://github.com/cockroachdb/cockroach/pull/74349.

rytaft commented 2 years ago

FYI @msirek, this issue gives an overview of the scope of work we were discussing yesterday

msirek commented 1 year ago

Notes: Regarding

take into account the different localities in the input and output distributions and the latency between them.

If we measure latencies between regions at every cluster startup, we might get query plans which change too frequently. Some choices are:

Measure latencies upon the first cluster startup and store them in a system table. When a node in a new region joins a cluster, detect that, and compute latencies between that region and all other regions.
Don't use precise latencies, but ranges of latencies. If latencies need to be remeasured, we don't want small differences between 2 measures to cause query plan changes. For example, ranges could be 0-5ms, 5-15ms, 15-60ms, 60-195ms, 195-600ms, above 600ms, the cost of the next range being triple that of the previous range.

michae2 commented 3 months ago

I wonder if #75178 should fall under the umbrella of this issue? It seems difficult for the physical planner to make the decision about whether to distribute scans without some kind of costing.

cockroachdb / cockroach

opt: make the optimizer "distribution aware" #47226