JackKelly / light-speed-io

Read & decompress many chunks of files at high speed
MIT License
43 stars 0 forks source link

High performance cloud object storage (for reading chunked multi-dimensional arrays) #10

Open JackKelly opened 7 months ago

JackKelly commented 7 months ago

Some notes (in no particular order) about speeding up reading many chunks of data from cloud storage.

General info about cloud storage buckets

In general, cloud storage buckets are highly distributed storage systems. They can be capable of very high bandwidth. But - crucially - they also have very high latencies of 100-200 milliseconds for first-byte-out.

Contrast this with read latencies of locally attached storage: HDD: ~5 ms; SSD: ~80 µs; DDR5 RAM: 90 ns. In other words: cloud storage latencies are two orders of magnitude higher than a local HDD; and cloud storage latencies are four orders of magnitude higher than a local SSD!

How many IOPS can we get from cloud storage to a single machine?

AWS docs say:

Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates [although the scaling takes time]. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second...

Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. These data lake applications achieve single-instance transfer rates that maximize the network interface use for their Amazon EC2 instance, which can be up to 100 Gb/s on a single instance. These applications then aggregate throughput across multiple instances to get multiple terabits per second.

(Also see this Stack Overflow discusion)

So... it sounds like we could achieve arbitrarily high IOPS by using lots of prefixes.

How fast can a single VM go? A 100 Gbps NIC could submit 20 million GET requests per second (assuming each GET request is 500 bytes). And a 100 Gbps NIC could receive 2 million 4 kB chunks per second (assuming a 20% overhead, so each request is 5 kB); or 2,000 x 4 MB chunks per second.

To hit 2 million IOPS, you'd need to evenly distribute your reads across 364 prefixes.

2 million requests per second per machine is a lot! It wasn't that long ago that it was considered very impressive if a single HTTP server could handle 10,000 RPS. I'm not even sure if 2 million RPS is actually possible! Here's a benchmark of a proxy server handling 2 million RPS, on a single machine.

I'd assume that, in order for the operating system to handle anything close to 2 million requests per second, we'd have to submit our GET requests using io_uring.

Some "tricks" that LSIO could use to improve read bandwidth from cloud storage buckets to a single VM:

JackKelly commented 5 months ago

More info about high-speed IO from Google Cloud Storage:

For cloud buckets, re-implement the cloud storage API using io_uring. Although, cloud storage buckets can be severely limited in terms of read IOPs. For example, the Google Cloud Storage docs state that "buckets have an initial IO capacity of.. Approximately 5,000 object read requests per second, which includes listing objects, reading object data, and reading object metadata." Although the documentation continues on to say "A longer randomized prefix provides more effective auto-scaling when ramping to very high read and write rates. For example, a 1-character prefix using a random hex value provides effective auto-scaling from the initial 5,000/1,000 reads/writes per second up to roughly 80,000/16,000 reads/writes per second, because the prefix has 16 potential values." Which suggests that a longer random prefix could increase read IOPs indefinitely. The "quotas" page says the "Maximum rate of object reads in a bucket" is "unlimited". Bandwidth is limited to 200 Gbps (about 20 GB/s). AWS S3 has similar behaviour (also see this Stack Overflow question). Although this does raise a problem with Zarr sharding: you might actually want to randomly take chunks which are nearby in n-dim space, and spread them randomly across many shards, where each shard has a random prefix in its filename (although that forces you to make lots of requests. It may be better to keep nearby chunks close together, and coalesce reads).

JackKelly commented 3 months ago

Another "trick" to improve performance: use HTTP/3 endpoints:

https://www.linkedin.com/pulse/http-10-vs-11-20-30-swadhin-pattnaik

GCS supports HTTP/3.

Also, on a different topic: when LSIO merges byte ranges, it should avoid hitting API rate limits for the number of GET requests

JackKelly commented 3 months ago

Although, it sounds like http/3 isn't always faster.

And, in object_store, using http/2 with GCS is currently considerably slower than using http/1.1 (that issue also has some interesting benchmarks showing 1.3 GB/s from GCS with 20 concurrent connections using http/1). Although that might be due to an issue with hyper.

Related

JackKelly commented 3 months ago

One other thought: Z-Order will be very important for fast reads on cloud object storage.

Also note that, when performing ranged GETs on GCS, the checksum is computed for the complete object. So the checksum is not useful for ensuring the integrity of a single chunk. So we'll need checksums for each chunk in the Zarr metadata, I guess.

JackKelly commented 2 months ago

Specific cloud VMs with high-performance networking (VM-to-VM bandwidth)

AWS

In summary: The AWS instances with the fastest network adaptors tend to be the "compute optimised" instances, where their name ends in n (for networking).

Google Cloud

See:

In summary: Some VMs support 100 Gbps. Some 200 Gbps. Some 1,000 Gbps!!

JackKelly commented 2 months ago

References

JackKelly commented 1 month ago

Papers to read

JackKelly commented 1 month ago

io_uring:

JackKelly commented 1 month ago

Also see: 2023: Google Posts Experimental Linux Code For "Device Memory TCP" - Network To/From Accelerator RAM. "If Devmem TCP takes flight, it will allow the accelerator/GPU/device memory to be exposed directly to the network for inbound/outbound traffic."

The zero-copy network rx io_uring folks are talking to the Google folks.

JackKelly commented 2 weeks ago

See this paper:

"Exploiting Cloud Object Storage for High-Performance Analytics", Dominik Durner, Viktor Leis, and Thomas Neumann, PVLDB 16 (11), 2023, 49th International Conference on Very Large Data Bases

(Hat tip to Nick Gates for telling me about this paper!)

Very interesting stuff. Lots of useful graphs. Including:

image

(note to self: I should aim for a similar graph for LSIO)

JackKelly commented 1 week ago

https://dynamical.org are doing a great job, publishing NWPs as Zarr. Maybe one of my areas of focus could be to provide high-performance read access to data on dynamical.org? e.g. to train ML models directly from dynamical.org's Zarrs?