Parallel upload/download from/to stream

Jinming-Hu commented 4 years ago

We have added parallel upload/download from/to buffer for blob_client in release 0.3.0. However, this may not function well if the buffer is larger than usable physical memory, due to inefficient page swapping in/out. If you have demand for uploading/downloading from/to file or stream, you can let us know by commenting in this issue. So that we can evaluate and consider adding this feature in future release.

ghost commented 4 years ago

I was using download_blob_to_stream for each blob to download all blobs in a container but was trying to make downloads faster by first listing all blobs list_blobs_segmented and then using download_blob_to_buffer for each blob. Both methods seem to be equivalent in speed and the limitation seems to be the internet anyway. Is this expected?

Jinming-Hu commented 4 years ago

With download_blob_to_stream API, you're downloading with only 1 thread. But with download_blob_to_buffer, if parallelism is properly configured, you're downloading with multiple threads.

I don't know how you got the conclusion that the limitation was the internet, and I cannot tell whether you're correct without detailed information. So can you share the following info?

What's the max_concurrency when constructing blob_client?
What's the parallelism when invoking download_blob_to_buffer?
What's the size of the blobs like?
What speed did you get with those two download APIs?
If you know, what's the bandwidth of your network?

ghost commented 4 years ago

max_concurrency is set to 16
parallelism varied between 1 and 16
Size of the blobs are about 120kB each
About 60ms per blob for both APIs
Bandwidth should theoretically max out at 1000Mbit/s

So my theory is that the blob sizes are too small to benefit from any parallelism which is why I see roughly the same download speeds. But my issue is that if I want to download all blobs from a container it will take about 5 minutes if I have 5000 blobs of size 120kB each. I wouldn't expect it to take 5 minutes to download 600Mb on a 1000Mbit/s connection. So I guess what I'm saying is that there should be a faster way to download an entire container if we could avoid making a request for each blob.

Jinming-Hu commented 4 years ago

@dhollsten Yes, you're right. The size is too small to benefit from parallel downloading. download_blob_to_buffer does chunk level parallelism. However, you can still do blob level parallelism by yourself. The code looks like this:

auto list_blobs_response = blob_client->list_blobs_segmented(max_result = 5000).get();

std::vector<std::future<storage_outcome<void>>> futures;
for (const auto& blob : list_blobs_response.blobs)
{
    auto f = blob_client->download_blob_to_stream/buffer();
    futures.emplace_back(std::move(f));
}

for (auto& f: futures)
{
    if (!f.get().success())
    {
        // error handling
    }
}

This way, you're downloading multiple blobs in parallel. Note that the parallelism is limited by number of blobs in one list_blobs_response and max_concurrency of the blob_client.

if we could avoid making a request for each blob.

There's no way to do this.

ghost commented 4 years ago

Thank you for the sample code, it works great! However sometimes I instantly get an exception (resource unavailable) probably because of blob storage throttling when I have parallelism set to higher than 1. I would have assumed that the retry policy should handle this?

class custom_retry_policy : public azure::storage_lite::retry_policy_base
{
public:
    azure::storage_lite::retry_info evaluate(const azure::storage_lite::retry_context & context) const override
    {
        const int max_retry_count = 30;
        if (context.numbers() <= max_retry_count && azure::storage_lite::retryable(context.result())) {
            return {true, std::chrono::seconds(2)};
        }
        return {false, std::chrono::seconds(0)};
    }
};

Jinming-Hu commented 4 years ago

@dhollsten Hi, to verify if it's a throttling issue, you can disable retry policy by using no_retry_policy and check the error code and error message. If the error code is 500 or 503, and the error message is something like server busy, then you're throttled.

Retry is a good method to resolve this issue, for example, you can implement exponential backoff retry policy.

Azure / azure-storage-cpplite

Parallel upload/download from/to stream #73