etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.65k stars 9.75k forks source link

Point in time recovery in ETCD #16962

Open KosovGrigorii opened 11 months ago

KosovGrigorii commented 11 months ago

What would you like to be added?

So I recently started implementing WAL_G functionality for etcd and while doing it I noticed that most of the things are already implemented here.

So you already Save WAL(state and entries) and snapshots to underlying stable storage. The only thing left behind is point in time recovery aka the ability for user to restore storage to a specific point in time using the closest snapshot and WAL.

Right now you only provide restoration to the last entry in WAL using snapshot.

Why is this needed?

PTR can be useful in case we want to skip last entries (for instance in order to recover deleted data etc) and/or rewrite it

KosovGrigorii commented 11 months ago

Is this possible to add timestamps to WAL Records which are stored inside "%016x-%016x.wal" files?

KosovGrigorii commented 11 months ago

@serathius could you please tell wether that is possible?

tjungblu commented 10 months ago

For the curious of me, can you detail out a bit how wal-g intends to do the backup of snapshots and the WAL? I assume it doesn't just copy the files over every minute ;)

Adding timestamps to the WAL record is technically easy to add to the proto def: https://github.com/etcd-io/etcd/blob/main/server/storage/wal/walpb/record.proto#L17

Not sure we would backport this to the current 3.5 branch though. What are your expectations with actually using it? Can this wait for a 3.6 release?

As for the restore procedure, are you aware of the implications around the revisions going back in time? Checkout https://github.com/kubernetes/kubernetes/issues/118501

KosovGrigorii commented 10 months ago

Thank you for your answer. wal-g is cli tool for archival and restoration and it will basically use already implemented in etcd methods to create snapshots and put it in storage, which user mentions in the config. In other words its wrapper that lets user work with different databases.

We want to provide users with the ability to restore their cluster on certain point in time meaning that all the transactions that were applied after this time will be ignored. With current content that is stored in records it's only possible to restore cluster on a specific transaction.

It for sure can wait for 3.6 release :)

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Denchick commented 1 week ago

Hi, @tjungblu! I hope you're doing well! I wanted to follow up on this issue. We're currently working on the SPQR project, which is about PostgreSQL sharding. We store our metadata in etcd and already know how to restore shards to a specific point in time using WAL-G.

But there is no PITR support for etcd, so #17233 is very important to us. Could you please tell us how we can move forward here? Thank you so much!