apache / kvrocks

Apache Kvrocks is a distributed key value NoSQL database that uses RocksDB as storage engine and is compatible with Redis protocol.
https://kvrocks.apache.org/
Apache License 2.0
3.47k stars 450 forks source link

Add support of bulk load for the string like HBase bulkload #1301

Open git-hulk opened 1 year ago

git-hulk commented 1 year ago

Motivation

Many scenarios need to bulk-load mass data regularly, and it may bring heavy workload and latency spike if loads through the API interface. So it will be better if we can offer a way to mitigate this issue.

Solution

We can use RocksDB Ingest SST to bulk load those data and support for simple strings only.

see more discussions in https://github.com/apache/kvrocks/discussions/1628

zuston commented 1 year ago

Thanks for proposing this. +1 for this feature.

ColinChamber commented 1 year ago

I'm willing to submit a PR!

git-hulk commented 1 year ago

@ColinChamber Assigned.

liucyao1990 commented 1 year ago

@git-hulk @ColinChamber Thanks for this PR , Is there any progress?looking forward to this bulkload function

ColinChamber commented 1 year ago

Recently I haven't had enough time. Looking forward to others to achieve it. Unassigned. @liucyao1990

git-hulk commented 1 year ago

Thanks @ColinChamber for your update.

jihuayu commented 1 year ago

@git-hulk For this feature, we need provide a command to load data, or provide a tool?

In my opinion, there are two steps here.

  1. Create SST files with the data.
  2. Ingest the SST files.

The second step requires stopping the world.

Do we need to support online bulk load? Will there be problems with stopping the world?

git-hulk commented 1 year ago

In my opinion, there are two steps here. Create SST files with the data. Ingest the SST files.

@jihuayu Yes, you're right. And I think it's good to only support the string type first.

Do we need to support online bulk load? Will there be problems with stopping the world?

My intuitive thought is yes for the online bulk load, even though it will block the write operations when ingesting SSTs.

For this feature, we need provide a command to load data, or provide a tool?

From my side, I would like to support loading the local SSTs via command and also provides a tool to generate SST files. For the tool input file, we can require users to put their data in a specified format like CSV or others.

jihuayu commented 1 year ago

@git-hulk Ok, I'm willing to submit a PR!

git-hulk commented 1 year ago

Thanks @jihuayu, assigned.

@zuston @liucyao1990 Also welcome to provide more input about how to use the bulk load.

liucyao1990 commented 1 year ago

@git-hulk @jihuayu Hi, here is the bulk load ingestion implementation of Pegasus. https://github.com/apache/incubator-pegasus/pulls?q=label%3Acomponent%2Fbulk_load+. FYI

git-hulk commented 1 year ago

@git-hulk @jihuayu Hi, here is the bulk load ingestion implementation of pegasus. https://github.com/apache/incubator-pegasus/pulls?q=label%3Acomponent%2Fbulk_load+. FYI

Cool, thanks for your input.

jihuayu commented 1 year ago

I will first create the SST generation tool. we have cluster and replication mode, Ingest SST may be different. I think I can first support Ingest in standalone mode.

git-hulk commented 1 year ago

Yes, that's right. It's good to NOT support the replication for now.

JackyYangPassion commented 5 months ago

Are there any updates here?

jihuayu commented 5 months ago

@JackyYangPassion No. Do you want to have a try?

JackyYangPassion commented 5 months ago

@JackyYangPassion No. Do you want to have a try?

Okk, I've been researching how to generate SST files recently.

I looked carefully discussions in https://github.com/apache/kvrocks/discussions/1628

Initially, this function only supports String type?

git-hulk commented 5 months ago

@JackyYangPassion Yes, we would like to support the string first since it's the simplest one. And it's definitely great if can involve other data types.

jihuayu commented 5 months ago

@JackyYangPassion Thank you! Supporting strings is our first step in the plan. We want to start by creating a basic version to provide to users for their use. This way, we can gather feedback from users on the functionality as early as possible. In the later stages, we will support more types and functionalities.