cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.16k stars 3.82k forks source link

sql: role membership cache refresh logic can cause a thundering herd of queries to `system.role_members` #135931

Open rafiss opened 15 hours ago

rafiss commented 15 hours ago

Describe the problem

When the role membership cache is invalidated, each node will need to rebuild the cache. Currently, each user whose membership is being looked up will cause a query on the system.role_members table. This can be seen by looking at the requestKey of the singleflight operation here: https://github.com/cockroachdb/cockroach/blob/89dae000e93ca462d99b251df47b1e162058b56c/pkg/sql/authorization.go#L641

This is wasteful. Each of those queries is already scanning all the data from system.role_members, so we should change this so that the cache only gets populated once per node. This will change the number of internal queries needed to repopulate the cache to be O(node count) rather than O((node count) * (active user count)).

To Reproduce The cache is invalidated by CREATE ROLE, DROP ROLE, and GRANT role statements, so this problem is easy to reproduce.

See the escalation https://github.com/cockroachlabs/support/issues/3128 for an example.

Expected behavior Don't have such a big latency hit.

Additional data / screenshots See the linked ticket.

Environment: This was reported on CockroachDB version v24.1.6, but all supported versions are affected.

Jira issue: CRDB-44792

Epic CRDB-43310