Open Smityz opened 3 years ago
Redesign of Pegasus Scanner, to solve the problem scan timeout. In my opinion,the root cause of the problem is the method of data sort. Rocksdb's data should use customized Comparator, which will reserve sorted by userkey(hash_key, sort_key), and then the prefix filter should very fast.
Why comparator use the default ByteWiseComparator at the beginning? At this time , maybe pegasus can fix to the new comparator(customized Comparator). To avoid data incompatible, we can support two comparator(add new Comparator), and the new pegasus cluster use new comparator.
1、support postfix,should scan all data,the cost as before, maybe the filter not important. 2、support prefix,need not scan all data, speed will increase by reduce scans.
Changing comparator will be a pain, as all the old data can not be read any more. Introduce a table level flag to indicate that whether we should use customized comparator? And we also need to test the performance impact of using customized comparator.
First of all, we use the default ByteWiseComparator because we design the key schema based on it. We design the hashkey length ahead of the hashkey bytes in order to prevent key conflict like:
hashkey = a, sortkey = xxx
hashkey = ax, sortkey = xx
With the default comparator, the two keys are seen as distinct:
01axxx
02axxx
So we chose this method, but didn't consider that one day we would need prefix filtering of hashkey. So now the problem is: how can we upgrade our key schema version to support efficient hashkey prefix-filtering, or do other workaround, without modifying the key schema (and also give up support of hashkey sorting), like the above solution that @Smityz came up with.
So let's change the comparator and check the performance impact first?
If there are no compatibility issues, I think changing the comparator is feasible, look forward to your PR @shenxingwuying
- We can set a
HeartbeatCheck
during scanning like Hbase StoreScanner, pegasus sever sends heartbeat packets periodically to avoid timeout, which performed like a stream
@Apache9 @Smityz https://github.com/XiaoMi/pegasus-java-client/pull/156 and https://github.com/XiaoMi/pegasus-go-client/pull/86 have fix next retry failed when timeout
, you can resolve the problem before refactor scanner
Proposal Redesign of Pegasus Scanner
Background
Pegasus provides three interfaces
on_get_scanner
on_scan
andon_clear_scanner
, for clients to execute scanning tasks.If we want to full scan the whole table, at first, the client will call
on_get_scanner
on each partition, and then partitions return acontext_id
which is a random number generated by the server to record some parameters such ashash_key_filter_type
,batch_size
and the context of this scanning task.Secondly, the client uses this
context_id
to callon_scan
and completes scanning in the corresponding partition in turn. Servers will scan the whole data of the table on the disk, and return compliant value to the client in batches.If the tasking end or any error happened, the client will call
on_clear_scanner
to clear its context_id on the server.Problem Statement
In actual use, such a design will cause some problems.
If we execute this scanning task:
Server will scan all the data in the table, then returns the prefix match key of the pattern. But we can speed it up by using prefix seeking futures of RocksDB.
Although we have a batch size to limit the scan time, it does not work if the data is sparse. In the case above, we need to scan almost the whole partition but it is possible that there is no row which matches the prefix,then it will be easy to timeout.
Proposal
For problem 1
[hashkey_len(2bytes)][hashkey][sortkey]
, so we can't directly use prefix seeking. But we can prefix seek[01][prefix_pattern]
,[02][prefix_pattern]
,[03][prefix_pattern]
...[65535][prefix_pattern]
in RocksDB.For problem 2
We can set a
HeartbeatCheck
during scanning like Hbase StoreScanner, pegasus sever sends heartbeat packets periodically to avoid timeout, which performed like a stream.We can change the way to count batch size: compliant value number -> already scan value number