Open yxiang92128 opened 4 years ago
Hi @yxiang92128 , I want to know how you tested the elapsed time. Did you measure the total end-to-end time or just local processing time excluding network round-trip time?
Because I think the network should take most of the e2e time. If the latency from your test client to AWS server and Azure server is different, the result doesn't make too much sense.
I measured the total time for the same numbers of objects in the list to come back. I am just wondering if I did something suboptimal in the above code. thanks.
@yxiang92128 I wouldn't think of that as a valid test. Because the network round-trip time would take most of the total time. If latency to one server is very low and to another server is very high, it's reasonable that you might see several times difference in total time.
Can you also share the latency to both servers?
yeah. I understand the network round-trip time varies between systems. I only wanted to confirm from my code, there is nothing I could do in order to improve the latency in the "list" operation.
thanks
@yxiang92128 I think your code is fine. It's very concise and straight-forward, I cannot find any places that can be further optimized.
@JinmingHu-MSFT @yxiang92128
I found list_blobs_segmented call is taking 10 to 20 seconds for 1000 items. I have been testing this with a number of iterations. Same directory with AzCopy takes about 1.5 seconds. Why is there such a huge difference? Do you have any insight as to why the order of magnitude difference between list_blobs_segmented and AzCopy?
UPDATE: The problem is the XML parsing and other stuff in set_postprocess_response. It takes about 5 seconds for 5000 items to be returned from the http request. It takes another 85 seconds to execute set_postprocess_response in cloud_blob_container.cpp (around line 477).
@JinmingHu-MSFT
Is there a way to improve the performance of list_blobs_segemented by passing certain options or use an entirely different function to achieve the list of a container incrementally by 1000 objects each iteration? Currently the list takes 3X longer than S3 with the same amount of objects in the bucket. See the following code snippet I currently have below:
Any ideas of potential improvement to the above code?
Thanks,
Yang