Closed cirocosta closed 5 years ago
or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).
I feel like this will be the biggest win in the long run.
However zstd
looks like a quick win because (I assume) it won't require a lot of changes.
We're looking to have parallel streaming in with zstd, both are quick wins but gzip uses too much cpu which could overload the workers if we just do parallel streaming.
The fact that raid0 doesn't seem to be helping is really strange to me. I don't know what "software raid" is but the fact that you have double the physical wires able to push data through should make plain writes faster. We should do the same test with raid and measure the write speeds without concourse. If we meet the gcp promise, its highly likely that it might be our streaming really is a big bottleneck.
Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.
Edit: Another confusing aspect to this is that, we don't do a jit compression, we compress all at once and then push the data.
Where do you see that? 🤔 I thought it was all streaming. We shell out to tar
but that should stream too.
@vito ohhhh that’s why we shell out, I thought it was to avoid writing to a temporary file. My other comment is invalid, it totally makes sense why were seeing this bottleneck then. It doesn’t matter if we raid or have a fast network if we can’t push bytes fast enough.
Update:
🎊
Are one or both of these likely to land in 5.4?
Hey,
I've been noticing that even though we recently got a better utilization of our worker nodes when it comes to system CPU usage (see "Why our system CPU usage is so high?"), our throughput when streaming between one worker to another is still pretty disapointing.
ps.: those bits going from workers to workers are compressed (gzip) - still, there's a lot of bandwidth left to be utilized
ps.: in the image above, the thin pink line at 400Mbps is a web instance piping data from one worker to another, thus, 200 + 200 Mbps
In order to test what would be the theoritical maximum amount of throughput that we could get on GCP, I started experimenting with stressing some of the edges in terms of streaming performance.
Here are some results.
Table of contents
Machines used
In the following tests, the same machine type was utilized (2 instances) with varying disk configurations:
Streaming directly from one node to the other without any disk IO
In this scenario, all that we need to do is make the receive side discard all of the input that it receives.
We can do that by taking all of the contents that come from the listening socket and writing it all down to
/dev/null
.What can be observed in this case is that GCP indeed delivers what it promises, giving the consistent ~10Gbps of network throughput.
This is specially true knowing that we're able to get a worker-to-worker bandwidth of consistent 10Gbs without much effort:
Receiver:
Transmitter:
With the results above, we can then establish that without any diks IO, we can get the promised 10Gbps of network throughput.
Streaming from one node to another, writing to disk when receiving
Now, changing the
receiver
side to write to a disk that gets attached to the node, we're able to see how the story changes.1TB non-SSD persistent disk
GCP promise
ps.: 120 MB/s ~ 960Mbps
Actual:
Looking at the results, seems like we're hitting the byte upper limit:
Now, we can clearly see:
1TB SSD persistent disk
GCP Promise:
ps.: 480 MB/s ~ 3.8Gbps
Actual:
Now, instead of being throttled on the bytes upper limit, we're throttled at the IOPS limit, even though the
bytes
number is quite similar, anyway.Extra - Local SSDs
I was curious about how much perf we could get using Local SSDs.
Without any particular tuning applied to the filesystems on them, I went forward three configs, all using the same machine type.
1 Local SSD (375GB)
3 Local SSDs (3 x 375GB) on Software RAID 0 EXT4
3 Local SSDs (3 x 375GB) on Software RAID 0 XFS
Next steps
It's interesting to see how the software-based RAID didn't help much, and that the default configs for a 1TB
pd-ssd
was the configuration that achieved the highest throughput.Could that be a flaw in the way that I'm testing these setups? Very likely!
With some baseline of what's some of the best throughput we can get in terms of disk IO combined with network throughput, we can go ahead with some more testing in terms of either adding alternate compression algorithms, or perhaps even being more aggressive in terms of streaming by doing that with more concurrency (it's currently done in serial).
Thanks!