Open Jinming-Hu opened 4 years ago
I was using download_blob_to_stream for each blob to download all blobs in a container but was trying to make downloads faster by first listing all blobs list_blobs_segmented and then using download_blob_to_buffer for each blob. Both methods seem to be equivalent in speed and the limitation seems to be the internet anyway. Is this expected?
With download_blob_to_stream
API, you're downloading with only 1 thread. But with download_blob_to_buffer
, if parallelism is properly configured, you're downloading with multiple threads.
I don't know how you got the conclusion that the limitation was the internet, and I cannot tell whether you're correct without detailed information. So can you share the following info?
max_concurrency
when constructing blob_client
?parallelism
when invoking download_blob_to_buffer
?max_concurrency
is set to 16parallelism
varied between 1 and 16So my theory is that the blob sizes are too small to benefit from any parallelism which is why I see roughly the same download speeds. But my issue is that if I want to download all blobs from a container it will take about 5 minutes if I have 5000 blobs of size 120kB each. I wouldn't expect it to take 5 minutes to download 600Mb on a 1000Mbit/s connection. So I guess what I'm saying is that there should be a faster way to download an entire container if we could avoid making a request for each blob.
@dhollsten Yes, you're right. The size is too small to benefit from parallel downloading. download_blob_to_buffer
does chunk level parallelism. However, you can still do blob level parallelism by yourself. The code looks like this:
auto list_blobs_response = blob_client->list_blobs_segmented(max_result = 5000).get();
std::vector<std::future<storage_outcome<void>>> futures;
for (const auto& blob : list_blobs_response.blobs)
{
auto f = blob_client->download_blob_to_stream/buffer();
futures.emplace_back(std::move(f));
}
for (auto& f: futures)
{
if (!f.get().success())
{
// error handling
}
}
This way, you're downloading multiple blobs in parallel. Note that the parallelism is limited by number of blobs in one list_blobs_response
and max_concurrency
of the blob_client
.
if we could avoid making a request for each blob.
There's no way to do this.
Thank you for the sample code, it works great! However sometimes I instantly get an exception (resource unavailable) probably because of blob storage throttling when I have parallelism set to higher than 1. I would have assumed that the retry policy should handle this?
class custom_retry_policy : public azure::storage_lite::retry_policy_base
{
public:
azure::storage_lite::retry_info evaluate(const azure::storage_lite::retry_context & context) const override
{
const int max_retry_count = 30;
if (context.numbers() <= max_retry_count && azure::storage_lite::retryable(context.result())) {
return {true, std::chrono::seconds(2)};
}
return {false, std::chrono::seconds(0)};
}
};
@dhollsten Hi, to verify if it's a throttling issue, you can disable retry policy by using no_retry_policy
and check the error code and error message. If the error code is 500 or 503, and the error message is something like server busy, then you're throttled.
Retry is a good method to resolve this issue, for example, you can implement exponential backoff retry policy.
We have added parallel upload/download from/to buffer for
blob_client
in release 0.3.0. However, this may not function well if the buffer is larger than usable physical memory, due to inefficient page swapping in/out. If you have demand for uploading/downloading from/to file or stream, you can let us know by commenting in this issue. So that we can evaluate and consider adding this feature in future release.