apache / kvrocks

Apache Kvrocks is a distributed key value NoSQL database that uses RocksDB as storage engine and is compatible with Redis protocol.
https://kvrocks.apache.org/
Apache License 2.0
3.55k stars 466 forks source link

Allow using S3 to backup the Kvrocks DB #1478

Open git-hulk opened 1 year ago

git-hulk commented 1 year ago

Search before asking

Motivation

Most users demand a backup of the DB dir, but we can only support backup in the local file system. And it may cause trouble if we didn't reserve enough disk space. It would be better if we can put the backup on cloud storage like S3/GCS/...

Solution

No response

Are you willing to submit a PR?

torwig commented 1 year ago

@git-hulk Let me try to implement this feature.

git-hulk commented 1 year ago

@torwig Thanks a lot. For this issue, I am not sure if it's good to compress the db into a single object and then upload it.

torwig commented 1 year ago

@git-hulk Thank you for your tip. I'm going to think about the whole process and suggest something like "high-level design" and "possible implementation(s)" before actually starting implementing so we can discuss all key things.

git-hulk commented 1 year ago

🆒 Thanks

chrisxu333 commented 1 year ago

Hi @torwig are you still working on this issue? If not @git-hulk could I take it up?

torwig commented 1 year ago

@chrisxu333 Currently, I can't dedicate my time to this issue. If you wish to run it, @git-hulk will reassign it to you.

mapleFU commented 1 year ago

Initialize S3/GCS etc would be a bit tricky, maybe opendal C SDK would help: https://github.com/apache/incubator-opendal . It would be also ok for testing in local machine. Other tools in C++ is also welcomed. Since s3 credit config is a bit tricky, I think we'd better use thirdparty library at first.

Also, the dependency would be a bit complex for using object SDK, we'd better make clear what the config would like. You can try to investigate how other system does that:

  1. https://tikv.org/docs/6.5/concepts/explore-tikv-features/backup-restore-cn/
  2. https://www.cockroachlabs.com/docs/stable/backup
git-hulk commented 1 year ago

To be honest, I didn't think clearly about whether this feature should be put inside Kvrocks. Perhaps implementing a new dedicated tool for the backup like ClickHouse is a good idea.

Refer: https://github.com/Altinity/clickhouse-backup

mapleFU commented 1 year ago

🤔 ClickHouse can read from remote S3, so I think it's able to upload or backup to s3.

However, TiKV only supports a br here. (See: https://tikv.org/docs/6.5/concepts/explore-tikv-features/backup-restore-cn/ ). Maybe we can considering using the sameway. It can also not bring any size amplify to our binary and hide the risk of unmature implemention.

git-hulk commented 1 year ago

@mapleFU Thanks for your great references!

asad-awadia commented 8 months ago

if it's good to compress the db into a single object and then upload it.

Why not?

Create, then compress the backup, and then upload the single file

kinoute commented 7 months ago

Encryption of the backup file(s) will be nice too. Right now we are planning to mount the PVC volume in our Kubernetes cluster as a cronjob, make an encrypted archive and upload it to S3.

But yes, the fact that the backup is first generated on the same volume can be problematic (lack of space etc).

git-hulk commented 7 months ago

But yes, the fact that the backup is first generated on the same volume can be problematic (lack of space etc).

Kvrocks allows changing the backup dir via config set backup-dir. And it's now using the rocksdb checkpoint as the backup which will use the hard link when copying files. Perhaps you can remove the backup after syncing to S3?

Xuanwo commented 7 months ago

Hi, I'm Xuanwo from the OpenDAL communiy. I'm watching on development of kvrocks for sometime and find this issue interesting.

As you may know, OpenDAL offers a unified data access layer, empowering users to seamlessly and efficiently retrieve data from diverse storage services. I feel like opendal will be a good fit for kvrocks to implement backup to/from storage services like s3/gcs/azblob/...

Since kvrocks code base is mainly cpp, there are two ways to integrate with opendal:


Sorry for not reading the thread carefully. I found @mapleFU already mentioned opendal.

mapleFU commented 6 months ago

@Xuanwo Here I think the performance is not the critical reason and we may not enable some advance feature about threading, I think opendal as a backend of RocksDB Env would be a goodway for solving both this and backup to hdfs

Xuanwo commented 6 months ago

opendal as a backend of RocksDB Env

It looks like a good idea. I don't have much understadning of RocksDB Env so I don't know if it's possible with a simple wrapper.

My friend @leiysky told me that rocksdb env requires append support which is not widely supported by object storage services (at least s3 doesn't). And even for services that support append, It might be not good for append many small chunks. This could be an issue.

Note: OpenDAL itself does support append but s3 doesn't.

mapleFU commented 6 months ago

After some discussion, maybe design some new syntax and using another thread / process to upload Backup in Local FileSystem to HDFS/S3 is also a way. This avoid the complex logic of intereact with rocksdb::Env, and could be done in a separate way