cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.79k forks source link

sql: schema changes can be very slow #47790

Closed ajwerner closed 4 years ago

ajwerner commented 4 years ago

Describe the problem

We've had a number of reports of schema changes being slow lately (##38111, #47607, #47512). This is a meta-issue for a problem which is theorized here and to track reproduction and evaluation. It also includes proposed steps.

Theory

We’ve exposed the job adoption cycle to some schema changes even if they’d happen super fast.

Alternative Consideration

We scan the entire jobs table in the job adoption loop, as this table gets big, that's probably slow.

Proposals

I'm not exactly sold on the type prioritization but at least it's easy. For the others, we could inject a function per type that controls its ranking.

It doesn't need to be very tightly coupled.

Environment:

ajwerner commented 4 years ago
  1. Reproduces well enough with the set of queries in https://github.com/cockroachdb/cockroach/issues/47607

Offline discussion with @spaskob to sidestep 2. above by instead having the Registry.Run() call force the registry to adopt specific jobs rather than any jobs. For 3. then we can have schema change jobs which are aware of subsequent jobs in the mutations queue launch a goroutine to call Registry.Run() explicitly or something like that.

  1. We should do as just not writing jobs in cases where we use sqlbase.InvalidMutationID

  2. also seems good

spaskob commented 4 years ago

This is a simple way to reproduce the slowness:

cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'
spaskob commented 4 years ago

It turns there's a short term fix that seems to work pretty well. The reason for the slowness is that if the schema change is not first in line in the table mutation queue it would return a re-triable error and the jobs framework will re-adopt and run it later. The problem is that the job adoption loop is 30s. This is now fixed via #48608.

spaskob commented 4 years ago

I will still work on the job improvements to prevent future regressions and simplify the jobs framework. This PR https://github.com/cockroachdb/cockroach/pull/48600 is an example of what is coming next.

jordanlewis commented 4 years ago

Points 2 and 3 were subsumed by #48621.

Points 4 and 5 are less high priority.

kocoten1992 commented 4 years ago

Hi, schema change has improve speed, but there still an issue

CPU usage is still extremely high (like 90%-100% of an 4 cores machine) when schema change

ajwerner commented 4 years ago

Hi, schema change has improve speed, but there still an issue

CPU usage is still extremely high (like 90%-100% of an 4 cores machine) when schema change

How much data are in the tables and how many schema changes are you running? If you’d be willing to provide any sort of scripts to reproduce or cpu profiles, that would be very helpful

aeneasr commented 3 years ago

We are also experiencing extremely high CPU usage (>90%) as well as bandwidth usage in our cockroach cloud 3-node production cluster (ID 36691cbd-9927-438f-83af-cdc3c06a2b20). Is this regression truly resolved?