cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.16k stars 3.82k forks source link

Improve index creation observability #135888

Open kevinkokomani opened 13 hours ago

kevinkokomani commented 13 hours ago

Is your feature request related to a problem? Please describe.

It's very hard to tell what's happening when an index is being created.

First, the progress bar on the jobs table does not appear to be very useful. We observed a customer case where the index creation was kicked off, and storage steadily increased for 8 hours while the job was running, from 8GiB total to 13+GiB. During this time, the index creation job stayed at 0% progress.

So second, I went to the logging to try to confirm exactly what the index was doing and confirm that it was behaving properly. There appeared to be no logging describing what the job was actually doing. These are the only logs:

Nov 19 08:14: queued new schema-change job <x> for table <a>, mutation <b>
Nov 19 08:14: job <x>: resuming execution
Nov 19 08:14: SCHEMA CHANGE job <x>: stepping through state running
Nov 19 09:36: waited for 3 [<x> <y> <x>] queued jobs to complete 1h21m40.615066558s
Nov 19 15:02: job <x>: pause requested recorded with reason
Nov 19 15:02: job <x>: adoption completed with error ‹×›
Nov 19 15:02 job <x>, session <z>: paused

I'll note that the waited for 3 jobs to complete log appeared to be self-referential. Two of the job IDs listed here were the same job ID as the index creation job - the other one was another SCHEMA CHANGE job that had the exact same waited for 3 jobs to complete log, with references to this original schema change job and itself. I don't know what this log is trying to tell us.

And then we don't get any progress output until the job is paused at 15:02.

adoption paused with error appears misleading - the job hasn't errored out or failed, it's still paused in fact.

Describe the solution you'd like

Describe alternatives you've considered

With no logging and no reliable progress indicator, I'm not aware of any alternatives except watching the capacity-used metric going up without context.

Jira issue: CRDB-44773