Open beef9999 opened 1 year ago
Curious if you can try running stress test in addition to db_bench?
make crash_test
.Inspiring work. I'm also doing similar research on rocksdb. I'm confused by the explanation given by the author about the poor performance on async. The given reason is that coroutines are not good at the CPU-bounded work in the single machine. However, switching cost of coroutines is small compared to thread context switching. So I think coroutines should at least perform as well as the original rocksdb in the worst case on single machine.
I currently have some experimental data suggesting that improper use of io-uring can lead to performance degradation.
What is your opinion on this? If you find it inconvenient to communicate on GitHub, can we communicate via e-mail or QQ?
Thx for your consideration.
I am curious if this work could be done as a new "port" of RocksDB? Most of the changes appear to be straight-up replacement for things in the std namespace with ones implemented elsewhere. For example, could this change be implemented by changing the use of std::mutex (et al) with "port::mutex" (or similar)? And if so would this open up other possible implementations of co-routine (like folly)?
Curious if you can try running stress test in addition to db_bench?
- compile db_stress
- run
make crash_test
.
@riversand963 I found the RocksDB 6.x in 2019 didn't have the crash_test yet, and our experiment was based on this old version.
By the way, do you know if RocksDB has ever tried the folly coroutine?
By the way, do you know if RocksDB has ever tried the folly coroutine?
Yes, Rocksdb async IO in MultiGet using folly coroutines. https://github.com/facebook/rocksdb/wiki/RocksDB-Contribution-Guide#folly-integration
@riversand963 @mrambacher @siying @mdcallag
Hi folks, I did some investigations on folly coroutines, and found some issues:
It's too heavy to use. When we try to build folly with make build_folly
, tons of projects will be downloaded and compiled, for instance, 100+ MB size of boost
source code.
The Meta internal folly project may have provided some good tools for RocksDB, but it should stay within header files. It's not worth to add the whole project just because we need coroutine.
Only support MultiGet
. Basically this change swapped the sync for
loop to the async batch submit and wait, when we need to query multiple keys. But there are more scenarios that need to introduce the coroutine technology. Even for the normal Get
for a single key, we may still use coroutines to optimize locks and reduce contention.
Not friendly to the legacy code. We have to adapt to the C++20 syntax, and add co_await
and co_return
to every piece of code.
io_uring
is not natively integrated. For now we need to enable both USE_COROUTINES
and WITH_LIBURING
, otherwise the coroutine would be still using psync
calls, instead of async I/O. And the current io_uring
implementation in io_posix.cc
is quite verbose, lacking of proper encapsulation.
So I wonder if there is any possibility that the RocksDB community is willing to accept a new coroutine lib, or, at least an alternative. Our project, PhotonLibOS
, is clean and small. No third-party dependencies, and io_uring
is natively integrated into the I/O stack. The core coroutine lib would be only 3k ~ 4k lines (if the community think that it's better to use source code directly, rather than adding the whole project) .
About the next work, my plan is to refer to the current MultiGet
implementation, and work out an equal version with our lib, so that the libs performance could be compared, on this specific API. What do you think?
About the next work, my plan is to refer to the current MultiGet implementation, and work out an equal version with our lib, so that the libs performance could be compared, on this specific API. What do you think?
This looks like a good starting point.
@riversand963 @mrambacher @siying @mdcallag
Hi folks, I did some investigations on folly coroutines, and found some issues:
- It's too heavy to use. When we try to build folly with
make build_folly
, tons of projects will be downloaded and compiled, for instance, 100+ MB size ofboost
source code. The Meta internal folly project may have provided some good tools for RocksDB, but it should stay within header files. It's not worth to add the whole project just because we need coroutine.- Only support
MultiGet
. Basically this change swapped the syncfor
loop to the async batch submit and wait, when we need to query multiple keys. But there are more scenarios that need to introduce the coroutine technology. Even for the normalGet
for a single key, we may still use coroutines to optimize locks and reduce contention.- Not friendly to the legacy code. We have to adapt to the C++20 syntax, and add
co_await
andco_return
to every piece of code.io_uring
is not natively integrated. For now we need to enable bothUSE_COROUTINES
andWITH_LIBURING
, otherwise the coroutine would be still usingpsync
calls, instead of async I/O. And the currentio_uring
implementation inio_posix.cc
is quite verbose, lacking of proper encapsulation.So I wonder if there is any possibility that the RocksDB community is willing to accept a new coroutine lib, or, at least an alternative. Our project,
PhotonLibOS
, is clean and small. No third-party dependencies, andio_uring
is natively integrated into the I/O stack. The core coroutine lib would be only 3k ~ 4k lines (if the community think that it's better to use source code directly, rather than adding the whole project) .About the next work, my plan is to refer to the current
MultiGet
implementation, and work out an equal version with our lib, so that the libs performance could be compared, on this specific API. What do you think?
There is some existing work about improvemnt of Multiget that you may refer: https://juejin.cn/post/7136110366546722847 https://juejin.cn/post/7134972530904793119
Background
RocksDB is a well-known embedded and persistent KV database in the industry. It has a log-structured storage engines and has been specially optimized for fast and low-latency storage devices. RocksDB is written in C++ and was open-sourced in 2013. Its code style is mature and stable, and the test coverage rate is high. The project also comes with a wealth of performance benchmark tools. It could be said that studying RocksDB and learning its engineering practice is always a hot topic for engineers who are interested in storage and low level optimization.
RocksDB uses multi-threads to support concurrency. In certain circumstances, however, coroutines might be lighter and more efficient than threads. According to tests, when the system is busy, the time for a thread switch could be as high as 30 μs; while using coroutines, it would only spend a dozen of nanoseconds.
PhotonLibOS
(hereafter referred as Photon) is a high-performance C++ coroutine library and IO engine open sourced by Alibaba. We used to compare some dedicated programs we developed with Photon with those candidates from industry likefio
andNginx
, and we've seen the former had achieved better performance results. Meanwhile, it happened that a certain business team inside Alibaba was using RocksDB, and their network + storage architecture has encountered some performance bottlenecks, so we began to investigate if we could help them solve this issue by introducing the coroutine technology. BTW, this is Photon's first grafting attempt on a large-scale mature software.Coroutine transformation
Let's look at the the conclusion first: all the work went surprisingly smooth. Without changing the main logic of RocksDB, but manually modifying 200 lines of code, and then using a small script to scan the code and do automatic conversion, we were able to build and run RocksDB successfully.
According to customer's requirements, we were using RocksDB 6.1.2 (released in the year of 2019), and there were 3175 test cases in it. After transformation, the new coroutine version has passed 3170, with a success rate of 99.87%. After preliminary analysis, the 5 failures were all because of the fundamental difference between thread and coroutine, for instance, the test case explicitly believed itself was running in a thread environment and asked for some extra check. However, these failed cases will not affect the normal operation of RocksDB.
In terms of performance, we were using the
db_bench
tool to measure KV OPS in four typical usages. The results show that the new coroutine version had achieved similar performance compared with the original one. In some heavy IO and high-concurrency circumstances, the former one would even double the performance (explained later).Coroutine library introduction
1. Concurrency Model
There are four common used concurrency models:
Multi-threaded
,Async-callback
,Stackful coroutine
,Stackless coroutine
. Photon is a stackful coroutine implementation.As shown in the figure below, Photon's code did not use the
coroutine
orfiber
according to the traditional naming convention, but still called it asthread
. Multiplethreads
run on top ofvcpu
, and thevcpu
here refers to the well-known native OS thread. Eachvcpu
will only occupy one CPU core at the same time. Even thoughvcpu
may shift among CPU cores, it is not perceptible forthreads
.Threads
have their own mechanism to migrate acrossvcpus
.The reason for all these naming is that Photon has always regarded coroutines as a kind of lightweight thread. When designing coroutine API, it also tries to be compatible with the
POSIX
standard andC++ std
syntax. If there is no special reminder, developers won't even tell whether this is a multi-threaded program or a coroutine program. This is one of the key feature that makes Photon unique among those open sourced coroutine libs.Besides, since the stackful coroutine implementation does not depend on compiler features (such as
async
andawait
in C++20), the switching point is encapsulated in the IO operation or event engine, so it's less intrusive to the legacy code.2. Async event engine
Each
vcpu
contains an asynchronous event engine. The so-calledevent
may come from the following aspects:Because it is necessary for Photon to determine the calling sequence of coroutines and the execution timing of IO, we regard it not only a coroutine library, but also a high-performance event scheduler. It supports multiple asynchronous engines, such as
epoll, io_uring, kqueue
, etc. On high-version Linux kernels above 5.x, we recommend using the io_uring engine. The io_uring engine is able to perform batch submissions and reaps through a single syscall, and thus will improve the overall performance of the system.Moreover, the biggest change between io_uring and epoll that ordinary users can be aware is that the io_uring engine naturally supports asynchronous file IO. After lib encapsulation, user can easily write synchronous code. Unlike
libaio
, no registration and callback is required, and memory doesn't need to be aligned either. Therefore, we did not encounter any trouble when using this interface to transform the original psync IO of RocksDB, but simply replaced their function names.3. Synchronization, locks, and atomic operations
There are many ways to achieve synchronization in a concurrent system. In addition to the classic
mutex
andsemaphore
specified by POSIX, some language frameworks have proposed their own synchronization semantics. For example, Golang'schannel
actually implements one of its philosophies, that is, "Don't communicate by sharing memory, but share memory by communicating". Photon's mutex and semaphore basically followed the design of POSIX, but it is slightly modified for the coroutine. We know that multi-threaded synchronization primitives generally rely on theFutex
functionality provided by the kernel. The two major syscalls of Futex are FUTEX_WAKE and FUTEX_WAIT respectively. Similarly, Photon's mutex is implemented much like a user-mode Futex, which also needs to use the coroutine'sinterrupt
andsleep
mechanism, and manage tasks through a linked list.Regarding the use of atomic operations, threads and coroutines are basically the same. The only difference is that developers can determine if a certain variable will only be used by coroutines from a single vcpu, then there is no need to use atomic variables. Because a single vcpu itself is thread-safe.
4. Transformation steps
The following will introduce the details about how to rewrite RocksDB into coroutines:
1. First, replace all standard C++ elements such as threads and synchronization primitives with the equivalents of Photon's coroutine version. Here is a classic example of condition variable:
After transformation, the code turns into:
As we can see here, the rule is very simple, just add the
photon::
prefix in front of thestd
.We believed that such simplicity will help flatten the learning curve for lib users and bring convenience when migrating legacy codebases. Digging into the code of photon::std::thread, we can find that it is actually a template class that supports passing in global functions, class member functions, lambdas, etc. Every time a new thread is created, a coroutine will be generated to run in the background. We know that RocksDB itself has a thread-pool for performing tasks such as
compaction
andflush
in the background. After replacement, it naturally turns into a coroutine-pool.In the coroutine code, the original
sleep_for
andwait
functions will no longer block the calling thread, but will yield the CPU, and the scheduler will determine the next coroutine to run and do context switching.2. The second step is to delete all thread-specific function calls, such as
pthread_setname_np
, which renames threads, or those syscalls to change the IO priority for the current thread.3. Finally, replace the
thread_local
keyword withphoton::thread_local_ptr
. As we all know, C++11 introduced this new keyword to replace the__thread
provided by the compiler, or thespecific_key
related functions provided by the pthread library. RocksDB relies heavily on thread local variables. It will look up the Version value stored in current thread and do the comparison in every persistent IO. If the value is outdated, it will probably try to acquire locks or atomic variables to get the latest Version. Similarly, Photon program also needs this kind of local cache, so that coroutines can keep a piece of exclusive data of their own.Code example:
db_bench standalone test
In order to facilitate verification, we forked a RocksDB repo from github, and submitted a Pull Request, including the 200 lines of changes mentioned above.
Please refer to the photon-bench.md file for detailed steps. Note that current implementation needs to explicitly specify the number of vcpus, and the default setting is 8. For the sake of fairness, the
taskset
command was used in the test, and the maximum number of cores for multi-threaded programs is also limited to 8. In terms of concurrency, the default value ofdb_bench
is 64, and this value will be consistent for coroutines and threads.The test machine is a high-end cloud VM, using the 6.x kernel and the gcc 8 compiler. Key number is 10 million. Page cache cleaned(cold start). Test time is 1 minute, and the final data are as follows (unit: OPS/s).
When doing read or sync write, the performance of the coroutine version is basically the same as the one of original version. When doing async write, data doesn't need to be flushed immediately. Because RocksDB's LSM-based storage engine can efficiently convert random writes into sequential writes, the performance is tremendously optimized with the help of page cache. So we guess that's one of the major reasons that the only performance decrease is observed. In this scenario, the entire workload becomes CPU-bounded, and that is what of coroutines not good at. They are designed for I/Os and multi concurrency.
In addition, another important reason is that we only performed syntax replacement without doing targeted software tuning. For example, the original version uses
asm volatile("pause")
to idle wait on a thread for current CPU. Could it be done by switching to coroutine sleep in the new scenario? The original version contains acore_local
module to accelerate per-core variable access, and how should it be transformed properly? There are still some issues to be discussed.Ace in the hole: coroutine-based network database
Seeing this, some people may ask, since the coroutine version of RocksDB doesn't seem to be very remarkable when doing standalone testing, why did these work ever need to be done? In fact, the greatest value of coroutines lies in discovering the potential performance capability of a network database, especially when we have a lot of clients.
For a long time, the
epoll loop
has been the de facto for implementing a high-performance net server. No matter it's an async-callback solution likeJava netty
,boost asio
, or a coroutine solution likeGolang
, the problem left to developers has always been how to achieve high concurrent IO within a small number of threads. Indeed, RocksDB itself is very friendly to multi-threaded code, but after being embedded in a net server, we will have to utilize thread-pools to distribute and maintain client requests. One side is anasynchronous multiplexing
system, and the other side is asynchronous
system, and that's why the connector sometimes becomes the bottleneck.On the other hand, because RocksDB has enabled
group commit
by default, multiple write requests will be combined into one. So the larger the concurrency, the better the performance will be. Coroutines can easily support millions of concurrency, while threads would feel struggle to deal with serious competitions in such a scale.As per the requirement of our customer, we embeded RocksDB in an RPC server, reduced the KV size and the total number of keys, and increase the number of clients to 1000. The two test candidates are:
The results are as follows (unit: OPS/s)
In this test, in order to be more friendly to multithreading, we even removed the
taskset
limit, allowing the original program to use up to 64 physical CPU cores. However, as the number of threads increases, the bottleneck emerges. In contrast, the coroutine solution only uses 8 threads (vcpu), and has achieved twice performance of the former one.Summarize
We successfully transformed a large-scale database software into coroutines by introducing the Photon library. It has proved the theoretical advantages of coroutines in heavy IO and high-concurrency circumstances.
It needs to be declared that since we are not experts in RocksDB, the transformation only stays at the language syntax level. We believe that some tuning methods should be leveraged to refine RocksDB’s internal logic, and to make it more adaptable to the three levels model, i.e., coroutines, threads, and CPU cores. Any way, the goal is to maximize the cache hit rate, and reduce resources competition probability, so that we could drain coroutine performance in those CPU-bound workloads as well.
Finally, the PhotonLibOS project is open sourced at https://github.com/alibaba/PhotonLibOS. If you are interested in C++ coroutines and high-performance IO, welcome to have a try.