Open kevinkokomani opened 1 week ago
Hi @kevinkokomani, please add branch-* labels to identify which branch(es) this C-bug affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
@kevinkokomani Is there a link to an escalation that provides more context? I am curious what the state of the job was when you ran show job <job_id>;
. Maybe it was stuck retrying (which would also be a bug).
For the issue about the error not being surfaced in the DB Console, that would be an o11y issue. I'm going to rename this issue so it's just focused on the hanging job issue. Please file a separate issue for the o11y team to investigate. the DB Console problem.
Describe the problem
A customer attempted to run an
ALTER DATABASE
to add a newly added region to their database's zone configurations:ALTER DATABASE db_name ADD REGION "region_name";
The behavior experienced was that this ran for two hours, seemingly stuck at 0% when checking the job's progress via the DB Console -> Jobs page. Running the statement again would yield the following error:
ERROR: "region "region_name" already added to database
However, running
SHOW REGIONS FROM DATABASE db_name;
disagreed with that error output above - the region did not show up in the output, namely, "region_name" does not appear below:It was only when running
show job <job_id>;
for the job ID that is shown via the DB Console -> jobs page that the actual cause for the error was revealed:For database or table objects created sufficiently long ago when the default
RangeMaxBytes
andRangeMinBytes
were much lower, and that haven't beenaltered
since, this is prone to happen. It doesn't appear that we have any automation during the upgrade progress that would change the defaults of these objects if there are new defaults (rightfully so, as we likely don't want to silently change values during a routine upgrade without approval from the operator).There are two main "problems" as it seems based on the above:
ALTER DATABASE ADD REGION
job probably shouldn't have been stuck in this state at all.To Reproduce
Should be reproducable with the following:
create database test_db;
alter database test_db primary region "us-east-1";
alter database test_db add region "us-west-1";
Expected behavior
Given the main problems:
show regions from database
does not match the current state of the database.Environment:
Any version in which there was an upgraded range size default and a cluster has been upgraded to that version
Additional context
Not knowing where the error is or how to fix it can block critical production deployments.
Jira issue: CRDB-42507