concourse / hush-house

Concourse k8s-based environment
https://hush-house.pivotal.io
29 stars 23 forks source link

hh: Improve our streaming throughput #49

Closed cirocosta closed 4 years ago

cirocosta commented 5 years ago

Hey,


I've been noticing that even though we recently got a better utilization of our worker nodes when it comes to system CPU usage (see "Why our system CPU usage is so high?"), our throughput when streaming between one worker to another is still pretty disapointing.


regular workload

ps.: those bits going from workers to workers are compressed (gzip) - still, there's a lot of bandwidth left to be utilized

ps.: in the image above, the thin pink line at 400Mbps is a web instance piping data from one worker to another, thus, 200 + 200 Mbps


In order to test what would be the theoritical maximum amount of throughput that we could get on GCP, I started experimenting with stressing some of the edges in terms of streaming performance.

Here are some results.

Table of contents


Machines used

In the following tests, the same machine type was utilized (2 instances) with varying disk configurations:


Streaming directly from one node to the other without any disk IO

In this scenario, all that we need to do is make the receive side discard all of the input that it receives.

We can do that by taking all of the contents that come from the listening socket and writing it all down to /dev/null.

What can be observed in this case is that GCP indeed delivers what it promises, giving the consistent ~10Gbps of network throughput.

This is specially true knowing that we're able to get a worker-to-worker bandwidth of consistent 10Gbs without much effort:

Receiver: receiver

Transmitter: transmitter

With the results above, we can then establish that without any diks IO, we can get the promised 10Gbps of network throughput.


Streaming from one node to another, writing to disk when receiving

Now, changing the receiver side to write to a disk that gets attached to the node, we're able to see how the story changes.

1TB non-SSD persistent disk

GCP promise

Operation type Read Write
Sustained random IOPS limit 750 1,500
Sustained throughput limit (MB/s) 120 120

ps.: 120 MB/s ~ 960Mbps

Actual:

Looking at the results, seems like we're hitting the byte upper limit:

Now, we can clearly see:

1TB SSD persistent disk

GCP Promise:

Operation type Read Write
Sustained random IOPS limit 25,000 1,500
Sustained throughput limit (MB/s) 480 480

ps.: 480 MB/s ~ 3.8Gbps

Actual:

Now, instead of being throttled on the bytes upper limit, we're throttled at the IOPS limit, even though the bytes number is quite similar, anyway.


Extra - Local SSDs

I was curious about how much perf we could get using Local SSDs.

Without any particular tuning applied to the filesystems on them, I went forward three configs, all using the same machine type.

1 Local SSD (375GB)

3 Local SSDs (3 x 375GB) on Software RAID 0 EXT4

3 Local SSDs (3 x 375GB) on Software RAID 0 XFS

Next steps

It's interesting to see how the software-based RAID didn't help much, and that the default configs for a 1TB pd-ssd was the configuration that achieved the highest throughput.

Could that be a flaw in the way that I'm testing these setups? Very likely!

With some baseline of what's some of the best throughput we can get in terms of disk IO combined with network throughput, we can go ahead with some more testing in terms of either adding alternate compression algorithms, or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).

Thanks!

jchesterpivotal commented 5 years ago

or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).

I feel like this will be the biggest win in the long run.

However zstd looks like a quick win because (I assume) it won't require a lot of changes.

kcmannem commented 5 years ago

We're looking to have parallel streaming in with zstd, both are quick wins but gzip uses too much cpu which could overload the workers if we just do parallel streaming.

kcmannem commented 5 years ago

The fact that raid0 doesn't seem to be helping is really strange to me. I don't know what "software raid" is but the fact that you have double the physical wires able to push data through should make plain writes faster. We should do the same test with raid and measure the write speeds without concourse. If we meet the gcp promise, its highly likely that it might be our streaming really is a big bottleneck.

Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.

vito commented 5 years ago

Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.

Where do you see that? 🤔 I thought it was all streaming. We shell out to tar but that should stream too.

kcmannem commented 5 years ago

@vito ohhhh that’s why we shell out, I thought it was to avoid writing to a temporary file. My other comment is invalid, it totally makes sense why were seeing this bottleneck then. It doesn’t matter if we raid or have a fast network if we can’t push bytes fast enough.

cirocosta commented 5 years ago

Update:

🎊

jchesterpivotal commented 5 years ago

Are one or both of these likely to land in 5.4?