Summary

We need to fundamentally modify RESTORE so that restored data becomes available to business-critical workloads immediately, even while the lengthy restoration process is still ongoing.

RESTORE as it operates today too slow to be a viable strategy for many customers to ensure they can meet their recovery-time-objectives (RTO). This stems from the simple fact that a CockroachDB cluster is able store more data -- data that then needs to be restored during a recovery event -- than can be copied from a backup in the amount of time the cluster is allowed to be unavailable.

To remedy this, we intend to make restoration an "online" process. In an online restoration process the data becomes available to the workload immediately, with some amount of degraded performance, even while the process of copying it into the cluster is ongoing.

Background

CockroachDB is designed to be resilient and highly available with the goal of ensuring a customer never needs to restore from backups. However the ability to do so in the event of a total cluster loss disaster, and to do so in an amount of time defined by business requirements, is still a must-have feature of any database of record.

We typically measure the size of a RESTORE as the total aggregate disk space used across all nodes after it is complete, replicated and compressed, as these are the bytes that the disks and network links need to transport. Our current target node density is up to 3TB/node, meaning a complete restore potentially needs to write approximately 3TB/node.

Today, we state that we expect RESTORE to write at approximately 50MB/s/node, when averaged across all nodes and the entire duration of the restore. For an optimally provisioned cluster, this points to a full-cluster restore time of 16.7 hours. This RTO is not acceptable to our users. This time can also be worse for users running in non-recommended, but still supported, configurations.

We likely could improve this number by some amount with concerted effort. As an upper-bound, "laws of physics" limit, we expect to be inarguably bottlenecked by disk write speed. Major cloud providers advertise write-rates of between 250MB/s and 1000MB/s. These would translate to a restore time of 3.3 hours to 50 minutes, respectively. This would likely be an acceptable RTO for most cases.

However, we feel it is unlikely we can achieve this theoretical upper bound with our current architecture.

Limitations of existing architecture

Under the current system, we observe inconsistent and uneven utilization across nodes. For instance, sometimes a cluster will sustain write rates of > 120MB/s/node for some time during a restore. But at other times several nodes will be idle or nearly idle, while just one or a few nodes are busy.

This inconsistency stems from a variety of factors. The clearest of these include basic work distribution or hardware differences, or noisy neighbors. These are in turn exacerbated by writing through the raft layer, where idle nodes must wait for busier nodes. The number of layers through which a restore flow must work poses further challenges as well. These layers are not fully available to Restore. Rather, they are shared with operations critical to cluster health (such as liveness). Each write layer consequently needs to throttle and back-pressure conservatively to avoid exceeding capacity of the layer beneath and destabilizing the cluster.

Potential improvements to existing architecture

It is possible that significant investment in tuning or rewriting the various layers that restored data must flow through today could improve its throughput. This would require broad effort: affected systems include the restore processors, the kv client and kv server, raft and replication and storage. Moreover, prior attempts to increase restore's throughput have in fact destabilized clusters. Examples including slow heartbeats causing membership thrashing or OOM's in the raft queue.

Any such improvement would need to be on the order of 10x, and could not compromise cluster stability. Work in this direction appears very expensive and very risky. It is also not clear that this work would be sufficient for all preferred use cases. Some customers have preferred RTO targets of <1hr, and a full return to operation includes steps beyond the formal RESTORE operation (cluster provisioning, backup bucket permissioning, etc.)

Preferred strategy: partial availability

We propose a new architecture: a RESTORE operation where a cluster can become available for SQL queries after some amount of "partial" download, even before every last byte has been copied. We know based on the sizes of caches and hit rates that for in most use cases, a small fraction of the data suffices to serve the vast majority of requests. By prioritizing which data to download first, we could in theory be completely available to most requests with very little data copied, i.e. after very little time. We believe this model would provide a substantially superior disaster recovery experience to users, at less engineering risk and cost.

Prioritizing bytes

One possible option to be partially available after only partially copying data might be to prioritize specific object-level restores. We could imagine downloading more critical tables before less critical tables, more critical databases before less, and so on, making each object available as it is restored. However, this poses some implementation challenges for schemas that include relations between tables. We also waste a fair amount of the potential of this strategy, as a large table may contain only a small number of high-priority rows.

Instead, we propose providing partial available while all bytes are available by providing only partial performance. Rather than waiting until all bytes are copied in to fast local storage, we will map in the remote backup files so that queries can read from them directly out of the backup. Such reads would expect to be slower than reads to files copied into the cluster, so initially such a schema has reduced performance and thus reduced capacity. However, based on our observed caching patterns, we expect that much the sacrificed availability would only minimally impact most workloads. This is analogous to populating a server cache on a fresh start.

In this approach, we would first "link" the remote backup data into the cluster via references directly to the remote files. These can be used by the execution of a query to fetch the actual data on the fly as needed while the RESTORE operation then downloads the linked files in the background. This means that the restored cluster or tables can be "online" from the perspective of the workload while it is being restored. Once enough of the required data is downloaded or cached, the workload's performance should reach a level that can be considered recovered. Given how much smaller this active set is expected to be, we expect that this approach could meet RTOs orders of magnitude more stringent than a restore that must copy everything to make anything available.

High-level design

Online restore builds on the low functionality being built in pebble to allow reading directly from files stored in external blob storage as opposed to local block devices, as part of its broader "disaggregated storage" support. Note that while it builds on the same low-level support however, the ability to use online restore is separate and independent from whether a cluster has been configured to use disaggregated storage.

In an online restore, rather than opening every file, iterating its content into batches which then get sent to the KV layer to be replicated and written, instead an initial "link" phase sends just the URLs and parameters of every file to the KV layer, which then replicates those links. Once all files are linked, the table, database, tenant or cluster is switched to be online and available to queries, before the restore proceeds to its second phase where it downloads the files pointed to by each link in the background.

When queries attempt to access data in files that have not yet been downloaded, pebble will transparently open the remote file and read the required data from it as needed, while also caching that data.

Detailed Design

TODO

Jira issue: CRDB-32740

cockroachdb / cockroach

restore: enable "online" restore #113074