hh: Improve our streaming throughput

cirocosta commented 5 years ago

Hey,

I've been noticing that even though we recently got a better utilization of our worker nodes when it comes to system CPU usage (see "Why our system CPU usage is so high?"), our throughput when streaming between one worker to another is still pretty disapointing.

regular workload

ps.: those bits going from workers to workers are compressed (gzip) - still, there's a lot of bandwidth left to be utilized

ps.: in the image above, the thin pink line at 400Mbps is a web instance piping data from one worker to another, thus, 200 + 200 Mbps

In order to test what would be the theoritical maximum amount of throughput that we could get on GCP, I started experimenting with stressing some of the edges in terms of streaming performance.

Here are some results.

Table of contents

Machines used
Streaming directly from one node to the other without any disk IO
Streaming from one node to another, writing to disk when receiving
- 1TB non-SSD persistent disk
- 1TB SSD persistent disk
Extra - Local SSDs

Machines used

In the following tests, the same machine type was utilized (2 instances) with varying disk configurations:

16 vCPU Intel(R) Xeon(R) CPU @ 2.20GHz
32 GB Mem
Ubuntu 16.04 / 4.15.0-1026-gcp

Streaming directly from one node to the other without any disk IO

In this scenario, all that we need to do is make the receive side discard all of the input that it receives.

We can do that by taking all of the contents that come from the listening socket and writing it all down to /dev/null.

What can be observed in this case is that GCP indeed delivers what it promises, giving the consistent ~10Gbps of network throughput.

This is specially true knowing that we're able to get a worker-to-worker bandwidth of consistent 10Gbs without much effort:

Receiver:

Transmitter:

With the results above, we can then establish that without any diks IO, we can get the promised 10Gbps of network throughput.

Streaming from one node to another, writing to disk when receiving

Now, changing the receiver side to write to a disk that gets attached to the node, we're able to see how the story changes.

1TB non-SSD persistent disk

GCP promise

Operation type	Read	Write
Sustained random IOPS limit	750	1,500
Sustained throughput limit (MB/s)	120	120

ps.: 120 MB/s ~ 960Mbps

Actual:

460 IOPS
1Gbps (~125 Mbps)

Looking at the results, seems like we're hitting the byte upper limit:

Now, we can clearly see:

throttling hapenning at ~460 IOPS
1Gbps of network throughput limited by the 1Gbps disk throughput
iowait times pretty high due to throttling

1TB SSD persistent disk

GCP Promise:

Operation type	Read	Write
Sustained random IOPS limit	25,000	1,500
Sustained throughput limit (MB/s)	480	480

ps.: 480 MB/s ~ 3.8Gbps

Actual:

1560 IOPS
3.3Gbps (~ 415MBps)

Now, instead of being throttled on the bytes upper limit, we're throttled at the IOPS limit, even though the bytes number is quite similar, anyway.

Extra - Local SSDs

I was curious about how much perf we could get using Local SSDs.

Without any particular tuning applied to the filesystems on them, I went forward three configs, all using the same machine type.

1 Local SSD (375GB)

3 Local SSDs (3 x 375GB) on Software RAID 0 EXT4

3 Local SSDs (3 x 375GB) on Software RAID 0 XFS

Next steps

It's interesting to see how the software-based RAID didn't help much, and that the default configs for a 1TB pd-ssd was the configuration that achieved the highest throughput.

Could that be a flaw in the way that I'm testing these setups? Very likely!

With some baseline of what's some of the best throughput we can get in terms of disk IO combined with network throughput, we can go ahead with some more testing in terms of either adding alternate compression algorithms, or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).

Thanks!

jchesterpivotal commented 5 years ago

or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).

I feel like this will be the biggest win in the long run.

However zstd looks like a quick win because (I assume) it won't require a lot of changes.

kcmannem commented 5 years ago

We're looking to have parallel streaming in with zstd, both are quick wins but gzip uses too much cpu which could overload the workers if we just do parallel streaming.

kcmannem commented 5 years ago

The fact that raid0 doesn't seem to be helping is really strange to me. I don't know what "software raid" is but the fact that you have double the physical wires able to push data through should make plain writes faster. We should do the same test with raid and measure the write speeds without concourse. If we meet the gcp promise, its highly likely that it might be our streaming really is a big bottleneck.

Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.

vito commented 5 years ago

Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.

Where do you see that? 🤔 I thought it was all streaming. We shell out to tar but that should stream too.

kcmannem commented 5 years ago

@vito ohhhh that’s why we shell out, I thought it was to avoid writing to a temporary file. My other comment is invalid, it totally makes sense why were seeing this bottleneck then. It doesn’t matter if we raid or have a fast network if we can’t push bytes fast enough.

cirocosta commented 5 years ago

Update:

parallel input streaming allowed us to get a much better network throughput (See https://github.com/concourse/concourse/issues/3992#issuecomment-504057768)
- not merged yet, but getting there! (see PR: https://github.com/concourse/concourse/pull/3993)
alternate compression algorithm got in!
- https://github.com/concourse/baggageclaim/pull/22
- https://github.com/concourse/concourse/pull/3880

🎊

jchesterpivotal commented 5 years ago

Are one or both of these likely to land in 5.4?

concourse / hush-house