coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
180 stars 25 forks source link

Is there support for downloading via s5? #133

Closed vrdn-23 closed 4 months ago

vrdn-23 commented 4 months ago

First of all, thanks for the awesome project! I was just curious if there was an option or a plan to support downloading the model files via s5 which allows us to obtain even higher network throughput?

Eta0 commented 4 months ago

Hello! Thanks for the thanks!

Currently, tensorizer uses curl for downloads. curl is a very mature and optimized HTTP client and generally performs faster than s5cmd, and especially so in the way that we use it.

s5cmd usually has two advantages:

  1. It can execute a batch of requests for many different objects at once (controlled with s5cmd --numworkers), and
  2. It can request an individual object concurrently in several segments (controlled with s5cmd cp --concurrency)

For tensorizer's use case, we only need to request a single object during deserialization, since tensorizer module files are not sharded, so batch requests won't help. For concurrency in downloading an individual object, this is implemented natively in tensorizer since release v2.9.0 and controlled with the num_workers parameter for a TensorDeserializer.

Single-object concurrency with the tensorizer format has more nuance than a general-purpose downloader like s5cmd can take advantage of. With a general-purpose utility like s5cmd with the only goal being to get a stream of bytes from a server onto the disk, the best strategy is to cut a file into equally sized segments (e.g. 0% to 20%, 20% to 40%, ...), and download those all in parallel. In the tensorizer format, however, there are logical boundaries within a file corresponding to:

  1. The file header,
  2. Metadata for the first tensor,
  3. Raw data of the first tensor,
  4. Metadata for the second tensor,
  5. Raw data for the second tensor, etc.

And cutting into the file at an arbitrary position (like the 20% mark) will most likely yield an unusable stream of data, starting from some random point partway through one of the tensors. Once all the downloads have completed, the file can be reassembled and read as a whole, but until then, the individual parts are almost useless.

In contrast, tensorizer is designed for streaming processing, so that all bookkeeping operations (like checksum validation) and transfers to the GPU can happen immediately as each tensor arrives from the network, without buffering the entire file in CPU RAM or on the disk. Streaming combined with concurrent processing of the received data allows for better utilization of system resources, lower latency, and the ability to stream in models much larger than CPU RAM.

In our implementation, we choose an optimally balanced split of the file across $n$ concurrent readers that cuts exactly along logical tensor and header data boundaries using our own linear partitioning algorithm, and begin parallel network transfers on each segment. Starting from those boundaries, we can perform streaming processing in each of the readers simultaneously, without ever needing to reassemble the file, which gives us high throughput coming in from the network and high throughput in getting that data into a usable state on the GPU. By the time all concurrent streams finish downloading, the file is already fully processed. With s5cmd, we would not be able to track the positions and progress of separate segments of the file as they are downloading, much less make them usable as we can with curl.

s5cmd provides benchmarks against s3cmd, s4cmd, aws-cli, and goofys, but not curl. Note that despite supporting .s3cfg configuration files for object storage credentials, we do not actually use s3cmd or any of those other utilities during downloads. The best comparison for raw network throughput that you would see from tensorizer is to run curl (or parallel invocations of curl on file ranges) against a presigned object storage URL and into /dev/null, which is very fast.

wbrown commented 4 months ago

@vrdn-23 Thank you for opening this issue! I hope this answers your question?

vrdn-23 commented 4 months ago

This does! Thanks a lot for the detailed explanation @Eta0 @wbrown! I'll close the issue now!