cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

[BUG] Asynchronous parallel lookup not working in REGIONAL BY ROW tables #110188

Open belmeopmenieuwesim opened 12 months ago

belmeopmenieuwesim commented 12 months ago

Describe the problem

This bug only applies to REGIONAL BY ROW. It works fine on REGIONAL BY TABLE locality.

My team is evaluatating CockroachDB and while benchmarking CockroachDB v23.1.8 in a multi region deployment (eu-west, us-west and asia) I noticed that certain SELECT queries against a REGIONAL BY ROW table are not always running in parallel when the query spans multiple regions.

Adding a WHERE clause to a select statement (.e.g. SELECT * FROM users WHERE password = 'test';) causes a 2.5x performance degradation latency wise. So while the max latency in my database cluster is around 186ms , a SELECT statement with a WHERE clause will take around 485ms to complete.

Meanwhile a simple SELECT * FROM users; statement completes properly around 186ms.

According to the docs, it should also run in parallel for REGIONAL BY ROW tables: https://www.cockroachlabs.com/docs/stable/cost-based-optimizer#locality-optimized-search-in-multi-region-clusters

To Reproduce

Given the following table:

CREATE TABLE users (
  id UUID NOT NULL DEFAULT gen_random_uuid(),
  username STRING NOT NULL,
  password STRING NOT NULL,
  CONSTRAINT users_pkey PRIMARY KEY (id ASC),
  UNIQUE INDEX users_username_key (username ASC)
) LOCALITY REGIONAL BY ROW

And assuming the following latency table: image

When executing the SELECT * FROM users; query against any node, it always finishes within 183ms. Looks good, everything happened nicely in parallel: image

But, when you execute SELECT * FROM users WHERE password = 'test'; (note the added WHERE clause), the query will always take around 485ms to finish. Clearly not running in parallel in all regions: image

Adding such WHERE clause causes a performance degradation of almost 2.5x! It seems to me this is a bug. It should just fetch all data in parallel matching the WHERE clause from all regions then join them together on the gateway node. If it did that, it should have finished around 183ms as well.

The bug does not seem to appear when solely using an unique index in the WHERE clause. If I would for example query SELECT * FROM users WHERE username = 'user'; then it takes around 180ms as well. Proving the parallelization works fine in that case: image

Expected behavior I would have expected that a simple SELECT statement that has a WHERE clause on it, would also execute in parallel on all remote regions, so that it finishes around the time that equals one network round trip to the furthest remote node (~186ms).

Extra Information image

Environment:

Add any other context about the problem here.

Jira issue: CRDB-31312

blathers-crl[bot] commented 12 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I was unable to automatically find someone to ping.

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

rharding6373 commented 11 months ago

Hi @belmeopmenieuwesim, sorry for the delay in getting back to you on this issue. If this still reproduces, could you please share statement bundles of the query with and without the WHERE clause using EXPLAIN ANALYZE (DEBUG)? If you don't feel comfortable attaching it to this issue, you can share it with me securely via https://upload.cockroachlabs.com/u/harding.

I see from what you've shared that most of the time in the query with the WHERE clause is spent in KV during the scan, but the statement bundle would have a trace that could help us figure out where the time is being spent.

Also, as a side note, to take advantage of locality optimized search, it's best if the query contains a LIMIT clause. This isn't very relevant to the issue you're seeing, since the table is empty and even with a limit specified this query would have to access every region to ensure that there were no rows that matched the search.