cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.87k stars 3.77k forks source link

spanconfig: SystemTargetTypeEntireKeyspace should not apply to NodeLiveness #102338

Closed adityamaru closed 1 year ago

adityamaru commented 1 year ago

The SystemTargetTypeEntireKeyspace is the system span config target that is written to apply a spanconfig to the entire keyspace of the system tenant. The keyspace currently translates to the keys.EverythingSpan which extends from [roachpb.KeyMin, roachpb.KeyMax). This implies that the spanconfig will apply to all non-table spans such as NodeLiveness, table-spans and all secondary tenants. A cluster backup and cluster restore in the system tenant use this system span config target to apply a protected timestamp on the entire cluster. As mentioned above, this protection will also apply to non-table spans such as NodeLiveness that are not backed up or restored. This prevents the GC queue from clearing up MVCC revisions from these non-table spans. NodeLiveness in particular has a short GC TTL of 600s so as to encourage faster GC because of the large number of updates that the span receives. The short TTL will not be respected if there is a running or paused cluster backup or restore in the system tenant.

This is a regression in behaviour from 21.2 where backup would only protect the table and tenant spans that it was going to read data from and export. To fix this we should consider modifying the SystemTargetTypeEntireKeyspace to only refer to table data and all secondary tenant keys in the system tenant keyspace.

Jira issue: CRDB-27400

andrewbaptist commented 1 year ago

This issue should be a high priority. This can result in the liveness range from getting cleaned up and scans becoming expensive. This can cause large clusters to fall over when it happens. We may not need the backport to 22.1, but definitely 22.2 and 23.1.

arulajmani commented 1 year ago

This can result in the liveness range from getting cleaned up

I don't follow. Could you expand?

andrewbaptist commented 1 year ago

@arulajmani - Sorry I meant to follow up on this earlier. The issue is that if the protected timestamp is "stuck" because of a backup or changefeed, then we accumulate a lot of garbage in the liveness range since it is updated every 3 sec by every node. We have seen cases where this gets stuck for weeks and the range becomes very slow to scan. This then ends up causing extended holding of latches and general badness.

aadityasondhi commented 1 year ago

Finally got around to start working on this. What do you think about something like this?

    // TableSpan is a span that covers all System and Tenant tables. It is a
    // subset of EverythingSpan and skips certain keyspaces that are not used in
    // full cluster backups (i.e. NodeLiveness, Timeseries)
    TableSpan = roachpb.Span{Key: TableDataMin, EndKey: roachpb.KeyMax}

Based on this doc, this will skip nodeliveness but also skip things such as timeseries and storage metadata. Is this too narrow scope for backups?

cc @arulajmani @andrewbaptist

arulajmani commented 1 year ago

this will skip nodeliveness but also skip things such as timeseries and storage metadata

That's a really good point. Maybe we should check with the DR team if this is something BACKUPs require protection over (@adityamaru would you happen to know?).


Alternatively, another approach here is to do something targeted just for the node liveness span. So instead of trying to redefine what EntireKeyspace means, we could instead do something in the GetProtectionTimestamps that's exposed by the ProtectedTSReader interface. Something around here:

https://github.com/cockroachdb/cockroach/blob/de4cb4b39700a81e7dc4aa305bdcc21f28c0ff85/pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go#L375-L384

I was thinking something like:

diff --git a/pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go b/pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go
index 7af3d07d090..e39c0beedcc 100644
--- a/pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go
+++ b/pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go
@@ -380,6 +380,13 @@ func (s *KVSubscriber) GetProtectionTimestamps(
                                if config.ExcludeDataFromBackup && protection.IgnoreIfExcludedFromBackup {
                                        continue
                                }
+                               if sp.Equal(keys.NodeLivenessSpan) {
+                                       // Node liveness is a high churn range that's frequently scanned. It
+                                       // has a low GC TTL to make sure such scans don't have to contend with
+                                       // too much MVCC garbage. In the spirit of that, let's not allow
+                                       // protected timestamps on this range.
+                                       continue
+                               }
                                protectionTimestamps = append(protectionTimestamps, protection.ProtectedTimestamp)
                        }
                        return nil
aadityasondhi commented 1 year ago

Alternatively, another approach here is to do something targeted just for the node liveness span.

I just want to call out that the issue mentions:

To fix this we should consider modifying the SystemTargetTypeEntireKeyspace to only refer to table data and all secondary tenant keys in the system tenant keyspace.

But would want someone from @cockroachdb/disaster-recovery to comment on whether we want to exclude everything but table data from this. Ideally, would like confirmation on which parts of the keyspace need the protected timestamp based on the doc I referenced earlier.

aadityasondhi commented 1 year ago

@arulajmani Based on this slack thread, what are your thoughts on changing the SystemTargetTypeEntireKeyspace to only include what I suggested above? Seems like BACKUP only cares about the table data and nothing else.

dt commented 1 year ago

backups of tenants (e.g. backup tenant foo) will backup their entire span, e.g. [/Tenant/5,/Tenant/6). Any other kind of backup -- table/database/"cluster" -- will backup some number of individual user and system SQL tables, but no non-table spans.

arulajmani commented 1 year ago

@arulajmani Based on this slack thread, what are your thoughts on changing the SystemTargetTypeEntireKeyspace to only include what I suggested above? Seems like BACKUP only cares about the table data and nothing else.

Based on the slack thread + what @dt mentioned, I think your suggestion to change the meaning of SystemTargetTypeEntireKeyspace makes sense. Maybe consider renaming that thing to be something more descriptive with our new understanding as well?

aadityasondhi commented 1 year ago

Going to backport fix

aadityasondhi commented 1 year ago

Backported to 22.1, 22.2, and 23.1.

andrewbaptist commented 1 year ago

We should revert this if it causes liveness range splits. This will cause a variety of problems.

arulajmani commented 1 year ago

We should revert this if it causes liveness range splits. This will cause a variety of problems.

That was a red herring. The PR that closed this issue wasn't causing liveness to split. We were instead running into https://github.com/cockroachdb/cockroach/issues/104195.