Closed vrdn-23 closed 4 months ago
Hello! Thanks for the thanks!
Currently, tensorizer uses curl
for downloads. curl
is a very mature and optimized HTTP client and generally performs faster than s5cmd
, and especially so in the way that we use it.
s5cmd
usually has two advantages:
s5cmd --numworkers
), ands5cmd cp --concurrency
)For tensorizer's use case, we only need to request a single object during deserialization, since tensorizer module files are not sharded, so batch requests won't help. For concurrency in downloading an individual object, this is implemented natively in tensorizer since release v2.9.0 and controlled with the num_workers
parameter for a TensorDeserializer
.
Single-object concurrency with the tensorizer format has more nuance than a general-purpose downloader like s5cmd
can take advantage of. With a general-purpose utility like s5cmd
with the only goal being to get a stream of bytes from a server onto the disk, the best strategy is to cut a file into equally sized segments (e.g. 0% to 20%, 20% to 40%, ...), and download those all in parallel. In the tensorizer format, however, there are logical boundaries within a file corresponding to:
And cutting into the file at an arbitrary position (like the 20% mark) will most likely yield an unusable stream of data, starting from some random point partway through one of the tensors. Once all the downloads have completed, the file can be reassembled and read as a whole, but until then, the individual parts are almost useless.
In contrast, tensorizer is designed for streaming processing, so that all bookkeeping operations (like checksum validation) and transfers to the GPU can happen immediately as each tensor arrives from the network, without buffering the entire file in CPU RAM or on the disk. Streaming combined with concurrent processing of the received data allows for better utilization of system resources, lower latency, and the ability to stream in models much larger than CPU RAM.
In our implementation, we choose an optimally balanced split of the file across $n$ concurrent readers that cuts exactly along logical tensor and header data boundaries using our own linear partitioning algorithm, and begin parallel network transfers on each segment. Starting from those boundaries, we can perform streaming processing in each of the readers simultaneously, without ever needing to reassemble the file, which gives us high throughput coming in from the network and high throughput in getting that data into a usable state on the GPU. By the time all concurrent streams finish downloading, the file is already fully processed. With s5cmd
, we would not be able to track the positions and progress of separate segments of the file as they are downloading, much less make them usable as we can with curl
.
s5cmd
provides benchmarks against s3cmd
, s4cmd
, aws-cli
, and goofys
, but not curl
. Note that despite supporting .s3cfg
configuration files for object storage credentials, we do not actually use s3cmd
or any of those other utilities during downloads. The best comparison for raw network throughput that you would see from tensorizer is to run curl
(or parallel invocations of curl
on file ranges) against a presigned object storage URL and into /dev/null
, which is very fast.
@vrdn-23 Thank you for opening this issue! I hope this answers your question?
This does! Thanks a lot for the detailed explanation @Eta0 @wbrown! I'll close the issue now!
First of all, thanks for the awesome project! I was just curious if there was an option or a plan to support downloading the model files via s5 which allows us to obtain even higher network throughput?