perf: kv falls off a cliff at ~12 nodes

benesch commented 6 years ago

@petermattis, from Slack:

Yesterday I discovered that kv performance falls off a cliff somewhere between 12 and 16 nodes in a cluster. Doing some more testing overnight I produced the chart above

Running a similar test on a 6-node cluster produces a smooth throughput curve.

similar increases in concurrency on smaller clusters don’t show the same behavior which implies this isn’t a number of goroutines / conns problem. possibly network.

(We expect the throughput to smoothly level off while the latency increases.)

benesch commented 6 years ago

Generation instructions:

$ roachprod test benesch-test-kv --binary ./cockroach kv_95 -c 1-64/64
$ roachprod web kv_95.cockroach*

For a smoother graph, turn the concurrency to 1-64/16 (the default).

I'm using normal roachprod VMs and have yet to repro the badness.

screen shot 2018-06-04 at 2 49 14 pm

benesch commented 6 years ago

n1-highcpu-16 did the trick. Consider this reproduced.

screen shot 2018-06-04 at 3 49 43 pm

nvanbenschoten commented 6 years ago

@m-schneider and I have our first lead. All of the graphs on the adminUI and the CPU and heap profiles looked fine. Everything was either equal or actually larger on the cluster with concurrency before the cliff (scaled exactly to the delta in throughput). The blocking profiles also looked similar until we ignored blocking in Go select blocks. Then we saw this:

600 concurrency

700 concurrency

On the very left of the 700 concurrency flame graph we can see 4.22% of the blocking is in StmtBuf.curCmd. This same method is only responsible for .14% of blocking in the 600 concurrency case. StmtBuf.curCmd is called by a sql connection's connExecutor goroutine when it is ready for another statement. A method waits for the corresponding pgwire goroutine to populate the buffer with a new statement. I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there. It's also possible that this is a slowdown in the kv workload itself. We're going to continue digging in there.

@m-schneider is also going to run the same experiment on AWS to see if she's able to reproduce the results.

m-schneider commented 6 years ago

After taking another look this looks very reproducible on a 16 node n1-highcpu-16 cluster. After seeing the same behavior as @benesch I ran a version of the test that incremented concurrency a bit slower and found that the drop off is generally around a concurrency of 650

Then I tried the same test with a 24 node cluster and also saw a very similar drop off though marginally later

nvanbenschoten commented 6 years ago

I think this indicates some slowdown on the workload/gateway connection. It's possible that we're hitting a connection limit somewhere in the network stack that's causing some slowdown there.

@a-robinson you have a lot of knowledge about networking. I'm interested in how you'd recommend investigating something like this.

petermattis commented 6 years ago

Is the worker machine overloaded or at capacity? Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd? We could add some instrumentation to check what that latency is?

I've used iftop in the past to look at network stats on the machine.

nvanbenschoten commented 6 years ago

Is the worker machine overloaded or at capacity

No, at the higher concurrency the cpu utilization actually drops because of the reduced throughput.

Is there a concurrency issue between reading a statement from the wire and notifying the goroutine waiting in StmtBuf.curCmd?

That's possible, although it would surprise me if such an issue reliably only showed up above a very specific concurrency threshold.

I've used iftop in the past to look at network stats on the machine.

Thanks for the tip!

nvanbenschoten commented 6 years ago

I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable doubling due to a ~20% increase in SQL connections seems awfully suspicious.

On a whim, I began tweaking StmtBuf to see if we were on to anything with the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf to use a buffered channel instead of the condition variable: https://github.com/nvanbenschoten/cockroach/commit/74085391b10d1e0e48ad9f2814ad59d8825bc1a4. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait call was simply replaced by roughly the same contribution from StmtBuf.curCmd's new runtime.selectgo call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change.

I also tested with calling runtime.Gosched immediately after signalling the condition variable, but that again had no effect.

I'm interested to see the results of @m-schneider's investigation into whether this is new since connExecutor refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary.

benesch commented 6 years ago

One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.

On Wed, Aug 29, 2018 at 12:50 AM Nathan VanBenschoten < notifications@github.com> wrote:

I spent a bit more time looking at CPU profiles between concurrency levels beneath this cliff and concurrency levels above the cliff. One thing that jumped out was the increase in time spent in runtime.schedule->runtime.findrunnable. Below this cliff this accounts for 5-6% of the profile, above that number doubles to around 11%. This is pretty significant, although I don't know exactly what to make of it right now. We expect the number of goroutines to grow proportionally with the number of SQL connections, but runtime.findrunnable doubling due to a ~20% increase in SQL connections seems awfully suspicious.

On a whim, I began tweaking StmtBuf to see if we were on to anything with the blocking in StmtBuf.curCmd. StmtBuf currently uses a slice that's shared between the producer (pgwire conn goroutine) and consumer (executor goroutine). It uses a condition variable to signal updates to the buffer and coordinate between the two goroutines. One theory I had was that this cond var may be sub-optimal in terms of quickly preempting the producer and scheduling the consumer whenever it is signalled. I ran an experiemnt where I switched the StmtBuf to use a buffered channel instead of the condition variable: nvanbenschoten@7408539 https://github.com/nvanbenschoten/cockroach/commit/74085391b10d1e0e48ad9f2814ad59d8825bc1a4. I like the change as it both simplifies the code and comes off as more idiomatic, but unfortunately it didn't actually have any impact on performance. The blocking contribution due to StmtBuf.curCmd's Cond.Wait call was simply replaced by roughly the same contribution from StmtBuf.curCmd's new runtime.selectgo call. It's possible that making this an unbuffered channel would have an effect, but that wasn't as easy of a change.

I also tested with calling runtime.Gosched immediately after signalling the condition variable, but that again had no effect.

I'm interested to see the results of @m-schneider https://github.com/m-schneider's investigation into whether this is new since connExecutor refactor. This will give us some indication of whether this is a scheduling issue at the pgwire/connExecutor boundary or whether it might be a scheduling issue at the network/pgwire boundary.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/26178#issuecomment-416823281, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IKsPXDd0awSV4wGt_B8AR0X0yjR3ks5uVh2JgaJpZM4USHmA .

nvanbenschoten commented 6 years ago

This seems somewhat related to https://github.com/golang/go/issues/18237 and https://groups.google.com/forum/#!topic/golang-nuts/6zKXeCoT2LM. My main takeaway from this so far is that having goroutine coordination on the hot path of SQL statement execution is unfortunate. I wonder how hard it would be to rip out the second goroutine for the sake of experimentation.

cc. @andreimatei.

nvanbenschoten commented 6 years ago

One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.

That's an interesting idea which is worth exploring.

andreimatei commented 6 years ago

cc @andy-kimball as he was also looking at the impact of that goroutine bounce in the executor a while ago.

petermattis commented 6 years ago

One other interesting experiment would be short-circuiting everything. If you hacked up your cockroach binary to return “no results” for every external query instantly, you’d be able to zero in on whether this is a network problem, load generator problem, or actual Cockroach problem.

Short-circuiting everything can prevent the load generator from running (i.e. you need CREATE TABLE to succeed). In the past what I had to do was to put a simple textual recognition of the load generator query and only short-circuit execution of that query. It was super interesting to see how fast pgwire worked. Definitely worthwhile to do this again with the new conn executor infrastructure.

m-schneider commented 6 years ago

After running the test on older versions of cockroach, it looks like the cliff already existed as early as the Dec 18th alpha. However even though overall throughput was lower in the October alpha, there doesn't seem to be an actual cliff.

Will keep bisecting.

m-schneider commented 6 years ago

After git bisecting the improvement in throughput can be attributed to https://github.com/cockroachdb/cockroach/pull/20232/commits/bcbde0291fa1df8ace01241ed081429db04cda6a. We're currently adding metrics to see if the new data structure could also be causing the cliff.

m-schneider commented 6 years ago

After instrumenting the time series cache, there doesn't seem to be any increase in latency coming from the skiplist implementation with increased concurrency:

nvanbenschoten commented 6 years ago

I spent some time looking into this after observing poor throughput scaling in sysbench as concurrency grew.

I was able to reproduce the behavior observed here by spinning up a 16 node cluster with n1-highcpu-16 machines. Instead of using roachprod test, I ran roachprod run <cluster>:17 -- ./workload run kv --init --read-percent=100 --write-seq=1000000 --splits=1000 --concurrency=<concurrency> --duration=5m {pgurl:1-16} directly, and alternated between a concurrency of 400 and 4000. When concurrency was at 400, throughput hovered around 200k. When concurrency was at 4000, throughput dropped to 120k. Future experimentation here should use this approach, as it makes it much easier to switch back and forth between the good and bad performance regimes.

I started by looking at performance profiles. The CPU profile showed very little difference between the two runs. Neither did the heap profile. Blocking profiles showed extra blocking in StmtBuf.curCmd, as we observed above. Mutex profiles were inconsistent and I didn't get any real insight out of them.

I then turned to the Go execution tracer. This is where things got interesting. The first execution trace is with --concurrency=400. The second is with --concurrency=4000.

concurrency=400

concurrency=4000

There are a number of things that jump out from these traces:

processor utilization is severly degraded in the second trace
the GC is significantly longer in the first trace (because of the higher throughput)
there are more runnable goroutines and running threads in the first trace
there are more timer events and syscalls in the first trace
there are significantly more network events in the second trace

Out of these two traces, it's pretty clear that the qualities of the first is more conducive of high throughput. So what gives? Why the degraded processor utilization in the second trace?

I turned to the execution trace's built-in profiles to try to answer this. My first stop was the network blocking profile. This didn't provide too much insight. Both profiles showed 100% of network blocking in netFD.Read. There was one difference though. In the good profile (conc=400), 93% of this blocking was attributed to the pgwire connection while 7% was attributed to the inter-node gRPC connection. In the bad profile (conc=4000), 100% was attributed to the pgwire connection.

My next stop was the execution trace's scheduler latency profile. This was more interesting. The good profile attributed 66% of scheduler latency to Mutex unlocking. The two main callers of this were Clock.Now (37%) and Clock.PhysicalNow (20%, avoidable, see #30520). The rest of the latency was attributed to runtime.netpolldeadlineimpl's call to runtime.ready. Things were completely different in the bad profile. Over 50% of the latency was attributed to runtime.selectnbsend, whose primary caller was controlBuffer.executeAndPut. If we focus in on this function, we see that its main caller (~70% of calls) is http2Client.NewStream, which is called exlcusively by roachpb/internalClient.Batch. I went back to the good profile and found that selectnbsend is only attributed with0.032% of scheduler latency.

The differences here are stark and I think it's safe to conclude that this is contributing to the reduced throughput with higher client concurrency. The question now is what to do about it. I still need to look more at this to understand why we only see the issue with a higher concurrency, but the first thing that comes to mind is that we create a separate gRPC stream for each internalClient.Batch call. This is something @a-robinson and @petermattis have considered in the past, so I'm curious if they have any insight here.

Another interesting thing to note is that we can see in controlBuffer.executeAndPut that it only sends on controlBuffer.ch if consumerWaiting == true. I think we might be hitting some inflextion point where consumers start waiting on the buffer, which causes things to slow down even further. This would explain the cliff-like behavior we see in baseline throughput. That's just speculation though.

I've included the two scheduler profiles here: scheduler_profiles.zip.

nvanbenschoten commented 6 years ago

My next step is to lean how all this controlBuffer stuff works and how it interacts with gRPC's loopyWriter.

petermattis commented 6 years ago

I believe @a-robinson investigated having a pool of streams for internalClient.Batch calls, but there is some difficult in properly handling context cancellation or something like that. I think this is documented somewhere, though I can't find it right now.

The loopyWriter stuff is all new since the last time I looked at gRPC. The signaling in controlBuffer.executeAndPut and controlBuffer.get is interesting. They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?

nvanbenschoten commented 6 years ago

They seem to be using controlBuffer.ch as a conditional variable. I wonder if there would be a benefit to actually using a condition variable here. You re-did the benchmark recently showing a condition variable to be faster than channels, right?

I looked into this and found that they can't easily use a cond variable there because they also want to wait on a context cancellation. Without adjusting other layers to catch the context cancellation and without https://github.com/golang/go/issues/16620, there's not an easy path to making the replacement.

But that doesn't matter now because... here's the fix: #30811.

nvanbenschoten commented 6 years ago

To wrap this all up, it turns out that CPU utilization was low because the client was bottlenecking itself. There wasn't anything going wrong in cockroach itself, it just wasn't getting a lot of traffic.

I began suspecting the client after running Cockroach with GODEBUG=schedtrace=1000 and seeing:

fewer threads with the higher concurrency
more idle threads with higher concurrency
a near-empty global run queue at both concurrency levels
empty local run queues at both concurrency levels

For instance:

concurrency=400

SCHED 58482ms: gomaxprocs=16 idleprocs=0 threads=47 spinningthreads=3 idlethreads=9 runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

concurrency=4000

SCHED 21150ms: gomaxprocs=16 idleprocs=13 threads=43 spinningthreads=0 idlethreads=17 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Combined with the execution traces above, it became apparent that goroutines weren't taking a particularly long time to be scheduled, there just weren't many goroutines to schedule.

tbg commented 6 years ago

Good job. Lots of prime lunch and learn material, too.

cockroachdb / cockroach

perf: kv falls off a cliff at ~12 nodes #26178

600 concurrency

700 concurrency