cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

Error for stuck ALTER DATABASE causes job to hang #131342

Open kevinkokomani opened 1 week ago

kevinkokomani commented 1 week ago

Describe the problem

A customer attempted to run an ALTER DATABASE to add a newly added region to their database's zone configurations:

ALTER DATABASE db_name ADD REGION "region_name";

The behavior experienced was that this ran for two hours, seemingly stuck at 0% when checking the job's progress via the DB Console -> Jobs page. Running the statement again would yield the following error:

ERROR: "region "region_name" already added to database

However, running SHOW REGIONS FROM DATABASE db_name; disagreed with that error output above - the region did not show up in the output, namely, "region_name" does not appear below:

   database  |      region      | primary | secondary |                           zones

-------------+------------------+---------+-----------+-------------------------------------------------------------

  db_name | other_region1     |    t    |     f     | {other_region_1a, other_region_1b, other_region_1c}

  db_name | other_region2     |    f    |     f     | {other_region_2a, other_region_2b, other_region_2c}

  db_name | other_region3     |    f    |     f     | {other_region_3a, other_region_3b, other_region_3c}

It was only when running show job <job_id>; for the job ID that is shown via the DB Console -> jobs page that the actual cause for the error was revealed:

{"non-cancelable: could not validate zone config: RangeMaxBytes 20971520 less than minimum allowed 67108864"}

For database or table objects created sufficiently long ago when the default RangeMaxBytes and RangeMinBytes were much lower, and that haven't been altered since, this is prone to happen. It doesn't appear that we have any automation during the upgrade progress that would change the defaults of these objects if there are new defaults (rightfully so, as we likely don't want to silently change values during a routine upgrade without approval from the operator).

There are two main "problems" as it seems based on the above:

  1. ~The Jobs page did not clearly show this error. Whether the error should also have a recommendation as to what to do next, I'm not sure, as it could be a number of actions.~ This will be handled in a separate issue.
  2. The ALTER DATABASE ADD REGION job probably shouldn't have been stuck in this state at all.

To Reproduce

Should be reproducable with the following:

  1. Create a cluster with an old range size default, or otherwise somehow force the range size to be lower than default
  2. Create a database: create database test_db;
  3. Add a primary region, which should succeed: alter database test_db primary region "us-east-1";
  4. Upgrade to a modern version with updated defaults
  5. Attempt to add a secondary region: alter database test_db add region "us-west-1";
  6. You should get the same error

Expected behavior

Given the main problems:

  1. ~I would expect the jobs page to show any error the job has hit, which it doesn't look like it's doing in this case. Perhaps there is room for a recommendation on what to do next - perhaps it should just say to reach out to support.~ This will be done in a separate github issue.
  2. Ideally, the ALTER DATABASE ADD REGION or any similar job would roll back instead of hanging in this state where show regions from database does not match the current state of the database.

Environment:

Any version in which there was an upgraded range size default and a cluster has been upgraded to that version

Additional context

Not knowing where the error is or how to fix it can block critical production deployments.

Jira issue: CRDB-42507

blathers-crl[bot] commented 1 week ago

Hi @kevinkokomani, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

rafiss commented 1 day ago

@kevinkokomani Is there a link to an escalation that provides more context? I am curious what the state of the job was when you ran show job <job_id>;. Maybe it was stuck retrying (which would also be a bug).

For the issue about the error not being surfaced in the DB Console, that would be an o11y issue. I'm going to rename this issue so it's just focused on the hanging job issue. Please file a separate issue for the o11y team to investigate. the DB Console problem.