cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

sql: ensure stats collector does not block node startup #85582

Closed AlexTalks closed 2 years ago

AlexTalks commented 2 years ago

While most tasks run during node startup are asynchronous, those that are not have the possibility to block node startup. One such example currently is Server.Start(..) in pkg/sql/conn_executor.go, as when starting PersistedSQLStats, it synchronously loads the schedule by executing the following query:

SELECT schedule_id FROM system.scheduled_jobs WHERE schedule_name = "sql-stats-compaction"

If this query encounters any contention, it will result in blocking node startup, which may hinder cluster operations as has been seen in a recent support escalation. If at all possible, these tasks should be made asynchronous, and if not, we should see we can make an effort to reduce or remove the possible contention.

The call stack is as follows, starting in server.go:Server.PreStart(..): s.sqlServer.preStart(..) -> s.pgServer.Start(..) -> s.SQLServer.Start(..) -> s.sqlStats.Start(..) -> s.jobMonitor.start(..) -> j.ensureSchedule(..) -> j.getSchedule(..) -> QUERY.

Additionally, if there are any other synchronous calls that could be blocking node startup during SQL server initialization, we should attempt to make them async if possible.

Jira issue: CRDB-18323

AlexTalks commented 2 years ago

Tagging @THardy98 in reference to the recent support issue mentioned.

irfansharif commented 2 years ago

+1, it seems non-ideal to block server startup unless we truly absolutely have to. The current pattern can cause any sort of txn contention in that table prevent new nodes being added or restarted, which seems bad.