Closed benesch closed 6 years ago
Generation instructions:
$ roachprod test benesch-test-kv --binary ./cockroach kv_95 -c 1-64/64
$ roachprod web kv_95.cockroach*
For a smoother graph, turn the concurrency to 1-64/16 (the default).
I'm using normal roachprod VMs and have yet to repro the badness.
n1-highcpu-16 did the trick. Consider this reproduced.
@m-schneider and I have our first lead. All of the graphs on the adminUI and the CPU and heap profiles looked fine. Everything was either equal or actually larger on the cluster with concurrency before the cliff (scaled exactly to the delta in throughput). The blocking profiles also looked similar until we ignored blocking in Go select blocks. Then we saw this:
On the very left of the 700 concurrency flame graph we can see 4.22% of the blocking is in StmtBuf.curCmd
. This same method is only responsible for .14% of blocking in the 600 concurrency case. StmtBuf.curCmd
is called by a sql connection's connExecutor
goroutine when it is ready for another statement. A method waits for the corresponding pgwire goroutine to populate the buffer with a new statement. I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there. It's also possible that this is a slowdown in the kv workload itself. We're going to continue digging in there.
@m-schneider is also going to run the same experiment on AWS to see if she's able to reproduce the results.
After taking another look this looks very reproducible on a 16 node n1-highcpu-16 cluster. After seeing the same behavior as @benesch I ran a version of the test that incremented concurrency a bit slower and found that the drop off is generally around a concurrency of 650
Then I tried the same test with a 24 node cluster and also saw a very similar drop off though marginally later
I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there.
@a-robinson you have a lot of knowledge about networking. I'm interested in how you'd recommend investigating something like this.
Is the worker machine overloaded or at capacity? Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd
? We could add some instrumentation to check what that latency is?
I've used iftop
in the past to look at network stats on the machine.
Is the worker machine overloaded or at capacity
No, at the higher concurrency the cpu utilization actually drops because of the reduced throughput.
Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd?
That's possible, although it would surprise me if such an issue reliably only showed up above a very specific concurrency threshold.
I've used iftop in the past to look at network stats on the machine.
Thanks for the tip!
I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable
. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable
doubling due to a ~20% increase in SQL connections seems awfully suspicious.
On a whim, I began tweaking StmtBuf
to see if we were on to anything with the blocking in StmtBuf.curCmd
. StmtBuf
currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf
to use a buffered channel instead of the condition variable: https://github.com/nvanbenschoten/cockroach/commit/74085391b10d1e0e48ad9f2814ad59d8825bc1a4. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd
's Cond.Wait
call was simply replaced by roughly the same contribution from StmtBuf.curCmd
's new runtime.selectgo
call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change.
I also tested with calling runtime.Gosched
immediately after signalling the condition variable, but that again had no effect.
I'm interested to see the results of @m-schneider's investigation into whether this is new since connExecutor
refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary.
One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.
On Wed, Aug 29, 2018 at 12:50 AM Nathan VanBenschoten < notifications@github.com> wrote:
I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable doubling due to a ~20% increase in SQL connections seems awfully suspicious.
On a whim, I began tweaking StmtBuf to see if we were on to anything with the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf to use a buffered channel instead of the condition variable: nvanbenschoten@7408539 https://github.com/nvanbenschoten/cockroach/commit/74085391b10d1e0e48ad9f2814ad59d8825bc1a4. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait call was simply replaced by roughly the same contribution from StmtBuf.curCmd's new runtime.selectgo call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change.
I also tested with calling runtime.Gosched immediately after signalling the condition variable, but that again had no effect.
I'm interested to see the results of @m-schneider https://github.com/m-schneider's investigation into whether this is new since connExecutor refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/26178#issuecomment-416823281, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IKsPXDd0awSV4wGt_B8AR0X0yjR3ks5uVh2JgaJpZM4USHmA .
This seems somewhat related to https://github.com/golang/go/issues/18237 and https://groups.google.com/forum/#!topic/golang-nuts/6zKXeCoT2LM. My main takeaway from this so far is that having goroutine coordination on the hot path of SQL statement execution is unfortunate. I wonder how hard it would be to rip out the second goroutine for the sake of experimentation.
cc. @andreimatei.
One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.
That's an interesting idea which is worth exploring.
cc @andy-kimball as he was also looking at the impact of that goroutine bounce in the executor a while ago.
One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.
Short-circuiting everything can prevent the load generator from running (i.e. you need CREATE TABLE
to succeed). In the past what I had to do was to put a simple textual recognition of the load generator query and only short-circuit execution of that query. It was super interesting to see how fast pgwire worked. Definitely worthwhile to do this again with the new conn executor infrastructure.
After running the test on older versions of cockroach, it looks like the cliff already existed as early as the Dec 18th alpha. However even though overall throughput was lower in the October alpha, there doesn't seem to be an actual cliff.
Will keep bisecting.
After git bisecting the improvement in throughput can be attributed to https://github.com/cockroachdb/cockroach/pull/20232/commits/bcbde0291fa1df8ace01241ed081429db04cda6a. We're currently adding metrics to see if the new data structure could also be causing the cliff.
After instrumenting the time series cache, there doesn't seem to be any increase in latency coming from the skiplist implementation with increased concurrency:
I spent some time looking into this after observing poor throughput scaling in sysbench as concurrency grew.
I was able to reproduce the behavior observed here by spinning up a 16 node cluster with n1-highcpu-16 machines. Instead of using roachprod test
, I ran roachprod run <cluster>:17 -- ./workload run kv --init --read-percent=100 --write-seq=1000000 --splits=1000 --concurrency=<concurrency> --duration=5m {pgurl:1-16}
directly, and alternated between a concurrency of 400
and 4000
. When concurrency was at 400
, throughput hovered around 200k. When concurrency was at 4000
, throughput dropped to 120k. Future experimentation here should use this approach, as it makes it much easier to switch back and forth between the good and bad performance regimes.
I started by looking at performance profiles. The CPU profile showed very little difference between the two runs. Neither did the heap profile. Blocking profiles showed extra blocking in StmtBuf.curCmd
, as we observed above. Mutex profiles were inconsistent and I didn't get any real insight out of them.
I then turned to the Go execution tracer. This is where things got interesting. The first execution trace is with --concurrency=400
. The second is with --concurrency=4000
.
concurrency=400
concurrency=4000
There are a number of things that jump out from these traces:
Out of these two traces, it's pretty clear that the qualities of the first is more conducive of high throughput. So what gives? Why the degraded processor utilization in the second trace?
I turned to the execution trace's built-in profiles to try to answer this. My first stop was the network blocking profile. This didn't provide too much insight. Both profiles showed 100% of network blocking in netFD.Read
. There was one difference though. In the good profile (conc=400), 93% of this blocking was attributed to the pgwire connection while 7% was attributed to the inter-node gRPC connection. In the bad profile (conc=4000), 100% was attributed to the pgwire connection.
My next stop was the execution trace's scheduler latency profile. This was more interesting. The good profile attributed 66% of scheduler latency to Mutex unlocking. The two main callers of this were Clock.Now
(37%) and Clock.PhysicalNow
(20%, avoidable, see #30520). The rest of the latency was attributed to runtime.netpolldeadlineimpl
's call to runtime.ready
. Things were completely different in the bad profile. Over 50% of the latency was attributed to runtime.selectnbsend
, whose primary caller was controlBuffer.executeAndPut
. If we focus in on this function, we see that its main caller (~70% of calls) is http2Client.NewStream
, which is called exlcusively by roachpb/internalClient.Batch
. I went back to the good profile and found that selectnbsend
is only attributed with0.032%
of scheduler latency.
The differences here are stark and I think it's safe to conclude that this is contributing to the reduced throughput with higher client concurrency. The question now is what to do about it. I still need to look more at this to understand why we only see the issue with a higher concurrency, but the first thing that comes to mind is that we create a separate gRPC stream for each internalClient.Batch
call. This is something @a-robinson and @petermattis have considered in the past, so I'm curious if they have any insight here.
Another interesting thing to note is that we can see in controlBuffer.executeAndPut
that it only sends on controlBuffer.ch
if consumerWaiting == true
. I think we might be hitting some inflextion point where consumers start waiting on the buffer, which causes things to slow down even further. This would explain the cliff-like behavior we see in baseline throughput. That's just speculation though.
I've included the two scheduler profiles here: scheduler_profiles.zip.
My next step is to lean how all this controlBuffer
stuff works and how it interacts with gRPC's loopyWriter
.
I believe @a-robinson investigated having a pool of streams for internalClient.Batch
calls, but there is some difficult in properly handling context cancellation or something like that. I think this is documented somewhere, though I can't find it right now.
The loopyWriter
stuff is all new since the last time I looked at gRPC. The signaling in controlBuffer.executeAndPut
and controlBuffer.get
is interesting. They seem to be using controlBuffer.ch
as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?
They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?
I looked into this and found that they can't easily use a cond variable there because they also want to wait on a context cancellation. Without adjusting other layers to catch the context cancellation and without https://github.com/golang/go/issues/16620, there's not an easy path to making the replacement.
But that doesn't matter now because... here's the fix: #30811.
To wrap this all up, it turns out that CPU utilization was low because the client was bottlenecking itself. There wasn't anything going wrong in cockroach
itself, it just wasn't getting a lot of traffic.
I began suspecting the client after running Cockroach with GODEBUG=schedtrace=1000
and seeing:
For instance:
concurrency=400
SCHED 58482ms: gomaxprocs=16 idleprocs=0 threads=47 spinningthreads=3 idlethreads=9 runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
concurrency=4000
SCHED 21150ms: gomaxprocs=16 idleprocs=13 threads=43 spinningthreads=0 idlethreads=17 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Combined with the execution traces above, it became apparent that goroutines weren't taking a particularly long time to be scheduled, there just weren't many goroutines to schedule.
Good job. Lots of prime lunch and learn material, too.
@petermattis, from Slack:
(We expect the throughput to smoothly level off while the latency increases.)