cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

sql: index backfill fails to complete with default settings #130939

Open andrewbaptist opened 2 days ago

andrewbaptist commented 2 days ago

After creating a cluster and attempting to add an index, the index can get stuck continually failing due to memory settings. The only way to make progress is to manually change the bulkio.index_backfill.batch_size setting, however that should not be required and is not clear from the UI.

Create a 12 node cluster:

roachprod create -n12 $CLUSTER 
roachprod put $CLUSTER artifacts/cockroach
roachprod start --store-count 2 $CLUSTER
roachprod ssh $CLUSTER:1 "./cockroach workload init kv {pgurl:1}"
roachprod ssh $CLUSTER:1 "./cockroach workload run kv  --duration=600s --max-block-bytes=10000 --min-block-bytes=10000 --concurrency=100 {pgurl:1-12}"

Attempt to create an index on the cluster

CREATE INDEX i ON kv.kv (k, v)

Note that this will never complete and instead get stuck on step 2.

To unstick it, set the backfill batch size (this can be run either before the index creation or while it is stuck).

SET CLUSTER SETTING bulkio.index_backfill.batch_size = 5000;

The index creation will now complete successfully.

Jira issue: CRDB-42302

blathers-crl[bot] commented 2 days ago

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

rafiss commented 1 day ago

Note that this will never complete and instead get stuck on step 2.

Could you clarify a bit more? What does "step 2" refer to?


Which version were you using when this occurred? We recently merged https://github.com/cockroachdb/cockroach/pull/128201, which is meant to address a problem similar to what you may have seen.

andrewbaptist commented 1 day ago

This is on master (from today) so it has that fix. Step 2 refers to the step listed in the jobs table (2/7). I didn't very carefully look at the details about why it was stuck but it is quick to run using the steps above.