proposal: refactor and enhance the write batch extractor

git-hulk commented 3 weeks ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Motivation

Currently, the write batch extractor plays a key role in cluster migration and sync tool, and we would also like to introduce it to our change stream mechanism.

But it now might have those (potential)issues:

missing enough test cases to
unable to extend due to it bindings to the RESP format
hard to maintenance due to implement all data types inside one function

This was also mentioned in issue #2583. To improve this situation, I would like to propose a new design for the write batch extractor.

Solution

Introduce an intermediate representation format for the parsed format like ChangeStreamEvent and it will contain the following fields:

struct ChangeStreamEvent {
        int16_t event_type;
    int16_t event;
    int16_t data_type;  
    std::string key;
        std::varint<std::string, double, std::vector<std::string>> payload;
};

event_type: required, the type of this event and it will affect how we interpret the payload. Can be one of VALUE | COMMADNS
event: required, it should be ADD|DELETE|SET if the event type is VALUE
data_type: required, should be one of [STRING|HASH|SET|ZSET|LIST|STREAM|BITMAP|JSON]
key: required, the key which was affected
payload: arguments for the event

For example, it will emit an event after receiving the command HSET my_hash f0 v0} :

{       
        .event_type = "VALUE"
    .event = "SET"
    .data_type = "HASH"
    .key = "my_hash"
    .payload = ["f0", "v0"]
}

Then, the cluster migration/sync tool/poll updates can generate their own format based on the change stream event.

TODOs

Add support of WriteBatchExtractor to allow to transform into ChangeStreamEvent
Add support RESP as the default output format for ChangeStreamEvent
Replace the WriteBatchExtractor output with the ChangeStreamEvent
Add test cases for all data types.

What do you think? please don’t hesitate to leave comments if you have any thoughts or suggestions.

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

PragmaTwice commented 3 weeks ago

It seems related to the logical logging cc @mapleFU .

Could you elabrate the detailed design of the "event" field?

How to define a new event? e.g. for ZSET data type, could you list all event, for every write commands here?

Also the "subkey" is related to different encoding for each data type. If we store "subkey" here, it's too "low level" to be a recorder for structured semantics. Also could you list all "subkey"s for all ZSET commands?

PragmaTwice commented 3 weeks ago

IMHO we should have a quite clear mind about the layered (or, leveled) semantics of execution of Kvrocks:

High level: Redis commands
Middle level: An orthogonal, but structured operations.
Low level: RocksDB key-value read/writes

Today we talk about the middle level operations.

This middle layer needs to have a clear design to meet three goals:

Orthogonal: Redis commands are obviously not orthogonal, but the middle layer must be orthogonal and can be "lower" from each Redis command to the middle layer.
Structured: The middle layer operators must have clear semantics, retaining high-level data structure information, rather than fragmenting into rocksdb reads and writes.
Integrated: The middle layer must have complete context information to avoid irreproducibility due to execution order changes.

As you can see, this is a complex task. I hope we can have a clearer design: for example, for each redis data type, what is the detailed design of its middle-layer operator?

Then, we can talk about implementations.

mapleFU commented 3 weeks ago

By the way, would it better to making rdb have the same structure? Or they're totally different?

PragmaTwice commented 3 weeks ago

By the way, would it better to making rdb have the same structure? Or they're totally different?

rdb is for serialization of data, not for serialization of execution. so I don't think it's so suitable.

git-hulk commented 3 weeks ago

How to define a new event? e.g. for ZSET data type, could you list all event, for every write commands here?

It generally contains: ADD|DELETE|SET. For example, the ZPOP command should generate a DELETE event.

If we store "subkey" here, it's too "low level" to be a recorder for structured semantics. Also could you list all "subkey"s for all ZSET commands?

What I mean by subkey here depends on the data type. For the Hash type, which we can see in the example, its subkey is the field f0. And for the ZSet type, its subkey should be its member.

I don't quite understand the meaning of low level here. Could you elaborate a bit about what you think?

Structured: The middle layer operators must have clear semantics, retaining high-level data structure information, rather than fragmenting into rocksdb reads and writes.

For the change stream event is now mapping to what's command changed instead of rocksdb read/writes. So that's why we have key and subkey.

git-hulk commented 3 weeks ago

As you can see, this is a complex task. I hope we can have a clearer design: for example, for each redis data type, what is the detailed design of its middle-layer operator?

Yes, I agree we should take care of the design before starting the implementation. We can reach a consensus about the high-level proposal, and then go into the details if all of us feel good about the proposal.

PragmaTwice commented 3 weeks ago

It generally contains: ADD|DELETE|SET. For example, the ZPOP command should generate a DELETE event.

How about commands like ZINTERSTORE, ZMPOP, ZPOPMIN? I'm still not sure if the current design is general enough. From my view it can easily become hard to model some complex scenarios.

e.g. for ZADD, the value should be member and score, but here you use just one std::string value, how to encode? and we need to decode to retrieve the score. IMHO it sounds a weird design.

Also I think it's better to generalize it to logical logging instead of just for migrate/sync.

git-hulk commented 3 weeks ago

@PragmaTwice I have changed to value to the payload which is a union type.

PragmaTwice commented 3 weeks ago

My current point is, if we cannot use it to model all operations, e.g. ZREMRANGE..., we'd better to keep the current (redis commands) form to make the impl unified and let user easily to understand.

But we can continuely focus and figure out a better design.

git-hulk commented 3 weeks ago

Got your point. I agree that using the commands would make it easy to have a unified way to implement the change logs. My concern is that need to rewrite many commands like HINCR/HINCRY to HSET to simplify the downstream implementation.

Genuineh commented 3 weeks ago

I am not very familiar with the internal structure of kvrocks, nor am I very familiar with C++. However, I personally think that it is only necessary to connect to rocksdb itself. As for the need for RESP or which data structure in https://kvrocks.apache.org/community/data-structure-on-rocksdb needs to be connected may be related to the user's needs (simply put, when considering performance, one will be more inclined to directly connect to the data-structure-on-rocksdb or kv of rocksdb). Only the parsing method of business-class data structures needs to be provided for users to use. In addition, maybe kvrocks will have its own business protocol and data structures that redis does not have in the future. Directly connecting to the structure on rocksdb can avoid this risk. And even if one only wants to use it as redis, if this structure is fully compatible with redis, it can be freely converted through the RESP parser.

apache / kvrocks