how to improve performance of list_blobs_segmented?

yxiang92128 commented 4 years ago

@JinmingHu-MSFT

Is there a way to improve the performance of list_blobs_segemented by passing certain options or use an entirely different function to achieve the list of a container incrementally by 1000 objects each iteration? Currently the list takes 3X longer than S3 with the same amount of objects in the bucket. See the following code snippet I currently have below:

    do
    {
        num_in_progress = 0;

        azure::storage::list_blob_item_segment result;

        // Azure support prefix filter as an argument which is handy
        result = container.list_blobs_segmented(utility::string_t(prefix), true, azure::storage::blob_listing_details::none, max_return, token, azure::storage::blob_request_options(), operation_context());

        // remember token
        token = result.continuation_token();

        for (auto& item : result.results())
        {
          if (item.is_blob())
          {

             // tune the clock from MS clock to Linux Epoch
             // it works with the offset now
             long unsigned int input = item.as_blob().properties().last_modified().to_interval();
             long unsigned int linuxtime_milisecs = input/10000 - epoch_offset;// diff between windows and unix epochs (seconds)

             num_in_progress++;
          }
          else
          {
             ucout << _XPLATSTR("Directory: ") << item.as_directory().uri().primary_uri().to_string() << std::endl;
          }
        }

        num += num_in_progress;

      // only when max_return is set to 0 when
      // we grab all items in one loop
      // otherwise we will set the token and return
      // whatever number of items this list_blobs_segments returns
    } while (!token.empty() && max_return == 0);

Any ideas of potential improvement to the above code?

Thanks,

Yang

Jinming-Hu commented 4 years ago

Hi @yxiang92128 , I want to know how you tested the elapsed time. Did you measure the total end-to-end time or just local processing time excluding network round-trip time?

Because I think the network should take most of the e2e time. If the latency from your test client to AWS server and Azure server is different, the result doesn't make too much sense.

yxiang92128 commented 4 years ago

I measured the total time for the same numbers of objects in the list to come back. I am just wondering if I did something suboptimal in the above code. thanks.

Jinming-Hu commented 4 years ago

@yxiang92128 I wouldn't think of that as a valid test. Because the network round-trip time would take most of the total time. If latency to one server is very low and to another server is very high, it's reasonable that you might see several times difference in total time.

Can you also share the latency to both servers?

yxiang92128 commented 4 years ago

yeah. I understand the network round-trip time varies between systems. I only wanted to confirm from my code, there is nothing I could do in order to improve the latency in the "list" operation.

thanks

Jinming-Hu commented 4 years ago

@yxiang92128 I think your code is fine. It's very concise and straight-forward, I cannot find any places that can be further optimized.

jamwhy commented 3 years ago

@JinmingHu-MSFT @yxiang92128

I found list_blobs_segmented call is taking 10 to 20 seconds for 1000 items. I have been testing this with a number of iterations. Same directory with AzCopy takes about 1.5 seconds. Why is there such a huge difference? Do you have any insight as to why the order of magnitude difference between list_blobs_segmented and AzCopy?

UPDATE: The problem is the XML parsing and other stuff in set_postprocess_response. It takes about 5 seconds for 5000 items to be returned from the http request. It takes another 85 seconds to execute set_postprocess_response in cloud_blob_container.cpp (around line 477).

Azure / azure-storage-cpp

how to improve performance of list_blobs_segmented? #329