apache / brpc

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
https://brpc.apache.org
Apache License 2.0
16.57k stars 3.98k forks source link

What are the differences between bthreads and goroutines? #45

Closed xuebinsu closed 6 years ago

xuebinsu commented 7 years ago

Since they both seem to be user-level M:N threads, I'm wondering how they differ in terms of functionalities and design principles. Thanks!

jamesge commented 7 years ago

Following table is from an un-opensourced document, comparing pros/cons of different methods to achieve concurrencies at 2014. The judges on goroutine may not be true now(2017), just for a reference. People who know status quo of goroutines are welcomed to correct me.

NOTE: we're assuming pthread impl. is 1:1

feature pthread callbacks goroutine bthread
gdb supported impossible supported supported  
local storage yes no no yes
exchange timeslice no no no yes // see note1
priority yes no no no
efficiency of concurrency low very high high // see note3 high // see note2
creation latency 100us 0.2us not measured, should be quick enough 0.2us
preemption full no simple no
easy-to-use ok bad good ok
max number < 1000 very high lower than callbacks lower than goroutines
scheduling latency 3-20us 3-20us 3us-10ms(gc) 3-20us(more busy more lower)

note1: exchange timeslice with the newly created thread to make it scheduled ASAP.

note2: parallelism is limited by #cpu-cores, can't be breakthrough by software, however concurrency can. The reason why bthread is more efficient at concurrency is butex, which makes blocking of bthread being separated from blocking of worker pthread. When a bthread blocks on butex, current worker pthread will continue to run other bthreads. For example, a server needs 200 pthreads(frequently blocking on mutexes/conditions) to maintain enough throughput, using bthread may lower number of worker pthreads to 8-24 (close to system load or #cpu-cores), however goroutine may still need ~200 workers(not true in 2017, see note3 below), because goroutines blocking on synchronizing primitives block underlying LWP directly.

note3: golang has a futex-like primitive now, https://github.com/golang/go/blob/b067ad939d9ba242c5c3bdd8a24220632311c6be/src/runtime/sema.go

xuebinsu commented 7 years ago

Thanks for your reply! The table does help to deepen my understanding of bthreads. But in my understanding, Go also has a scheduler to ensure that blocking one goroutine would not block the underlying kernel-level thread and it also uses work-stealing to take advantage of multi-core processors. (Please correct me if I'm wrong.)

jamesge commented 7 years ago

work-stealing (steal half of the queue as I remembered) and scheduler does not solve the concurrency issue. I had read source code of go at then and didn't find anything similar with butex. Without a butex-like synchronizing primitive, mutex or condition can only be implemented with futex-like primitives (at best), and goroutine waiting for a mutex/condition (think about the client waiting for response in a sychronous RPC) blocks the underlying LWP, being less efficient at concurrency.

xuebinsu commented 7 years ago

Thanks for your reply! In my understanding, Go does not rely on synchronization primitives to ensure that the execution would never be blocked. Instead, this is done by the Go scheduler, which switch the context to another goroutine that is runnable when the current goroutine is blocked, according to this article . To do that, Go hooks all blocking system calls and users can only call the hooked version instead.

jamesge commented 7 years ago

The underlying LWP is still blocked by syscall and the total number of active LWP is pretty limited. Since goroutine adds new workers on-demand, this issue is possibly not obvious to ordinary users. The only feasible method to make syscalls not block underlying LWP is to re-implement them, replacing the blocking primitives, which are mostly based on futexes.

xuebinsu commented 7 years ago

@jamesge

Thanks for your explanation! It solves many issues on currency that confused me since I tried to implement a N:1 user-level thread library as my undergraduate course project. By the way, is there any document on butex now? If not, will it be released and when?

chenzhangyi commented 7 years ago

@xuebinsu The implementation of butex is not complex. Semantics of apis are just like futex. Read the source if you want to know more details.

xuebinsu commented 7 years ago

@chenzhangyi Thanks for your reply. I'll check that.

ariverhorse commented 7 years ago

Great work! I have not got time to read the detail implementation, just curious is this bthread concept similar to Facebook's folly::fiber where user level thread switching are built on top of boost::context? In another word, is M:N thread model running M fibers on top of N pthread? If I have a serivce with callback that has a lot of IO bound work, I will need to call butex.wait() to yield CPU to other user thread of the same pthread explicitly, right?

wolf0403 commented 7 years ago

I'm not sure I understand this discussion correctly, especially the part wrt futex and blocking syscalls.

A syscall is "blocked" means when you call into the kernel (that's what "syscall" mean - transferring execution into the kernel), the kernel waits for the whole operation to complete before returning control back to user-space. There is no mutex / futex involved and absolutely nothing you can reimplement in user space to avoid this "blocking" behavior, expect if you can find another non-blocking version of the syscall to replace / mimic the blocking version. Golang does this for network read / write but not for other syscalls. If a goroutine calls a blocking syscall (i.e. reads disk file) Golang creates an OS thread (referred to as "LWP", I assume?) to park that goroutine and carries on with others. There is no mutex / futex calls involved in these cases. My understanding is that because brpc has a fixed number of threads, it does not creates OS-threads in this cases so that after a bthread calls a blocking syscall, there will be one less OS-thread to schedule "bthreads".

If the discussion is about "calling OS-provided mutex will lock whole OS-thread" then no, Go sync.Mutex deals with it's scheduler and does not directly calls down to OS provided mutex and locks the whole OS threads. The source is at [1] and it's relatively simple and well documented.

[1] https://golang.org/src/sync/mutex.go

jamesge commented 7 years ago

@wolf0403 "note3" in above post clarifies that go does have futex-like primitive(sema) and sync.Mutex which is based on the primitive does not block OS-thread. But at the time that bthread was designed, as I remembered, the mutex in go was implemented by os.futex directly. On the other hand, after extensive usages of brpc inside Baidu, we know that blocking OS-thread or not isn't a big deal generally, because:

wolf0403 commented 7 years ago

Well, there are different problems here, again - 1) user-program initiated waits (mutex / futex / critical section, whatever) and 2) blocked syscallls.

I agree that the 1) should be minimized by user program in all cases, and maybe for most of the workload you care about, RPC calls are the majority and any other blocking syscalls are the long tails. The difference is that Go program does not have chance to have called OS mutex and blocked machine thread and large number of Goroutines, while the "butex" is opt-in by user programs and is not yet complete.

The problem here is that for both 1) and 2), bthreads will have a much larger impact than the current implementation of Go, because you are reducing the total number of OS-threads, or CPU cores that are doing actual computation / scheduling user-level threads. This gets more series when the M:N (go-routine vs OS threads) ratio is high, and it tends to be high for Go programs (1 OS-thread per core, so 32 or 64, while >1,000s or maybe orders of magnitudes higher for Goroutines).

(FTR: for "current" I guess it means after Go1.1 in 2012 after the scheduler change I mentioned earlier - but sync package in 1.0 does not use OS futex [1] already.)

[1] https://github.com/golang/go/blob/release-branch.go1/src/pkg/sync/mutex.go

jamesge commented 7 years ago

Having 10000 goroutines does not mean that they're blocked by syscalls at the same time. To understand the situation better, you have to calculate "concurrency" which is the multiply of latency and throughput (qps), and which is comparable to number of OS-threads. There're a lot of programs that have a lot of user threads(>10000) are actually low at concurrency (several), as we saw in miscellaneous systems inside Baidu. The root cause, as we agreed, is that synchronous RPCs contribute most of the blocking. Left blocking just adds a small fraction to the "concurrency". In addition, butex is not related to "fixed number of OS-threads". Actually some variants of bthreads do support auto-adaptive number of OS-threads. The opensource version also can manually change number of workers at run-time.

wolf0403 commented 7 years ago

Having 10,000 goroutines on 16 core systems means... when a goroutine makes a blocking system call, it's blocking 1 goroutine while there are still 16 cores scheduling the rest 9,999 goroutines. My understanding of bthreads is that you will have only 15 cores scheduling the other bthreads - be that 9375 or 9999 (which you need to do a lot of work stealing).

Sure, as I've mentioned earlier it may not be important for your particular workload (either your user-threads are doing long waits on network I/O, or similar). I've handled large number of low traffic, long pooling requests with gevents and they are also fine for my particular case, but clearly gevent is nothing near Golang in completeness, robustness, etc. For the demanding workload which really need to push it, Golang clearly has done (not "potentially doing", but actually delivered) better jobs, but YMMV.

jamesge commented 7 years ago

“you will have only 15 cores scheduling the other bthreads - be that 9375 or 9999”. yes and it even has advantages: the work stealing and TLS-related stuff will be more efficient. OS-threads are not that easy to be blocked one-by-one and finally get exhausted as you expected. Of course it's easy to construct edge cases to block all OS-threads, but in real applications, as long as the blocking operations(except RPC) only block for a reasonable amount of time, the blocking problem is more like calculating an expectation: "how many OS-threads will be blocked in average?", which is just the "concurrency" calculated above (according to little's law).

Besides, bthread offers bthread_mutex_t and bthread_condition_t which are nearly drop-in replacements for pthread-equivalences. It's users' decisions to use them or not. Designing like this because: It's much more dangerous to override blocking operations in C++ than in Go, which has full control of what user can do.

goroutines also need to balance different factors in overriding a blocking operation. If a user thread wants to override read/write, the only choice under linux is to use a epoll. As a result, the overridden read/write need to pay the price for epoll_ctl, which is not very efficient in multi-threaded environment. Thus the designer faces a dilemma: the overridden read/write are more concurrent in calls that block for a long time, however they take more time in calls that just block for a while or not block. Designers of a new language surely have more freedom, but they still need to make a choice. A common method is to find out the time that read/write may take in average, and make decisions based on that. As you see, this is also a decision according to probabilities. Or designers make a compromise between performance and other factors (like simplicity of code), which is another issue.