Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.32k stars 1.97k forks source link

[BUG]azure-storage: High Lock Contention for concurrent Blob Uploads at scale #41798

Open saty101 opened 2 weeks ago

saty101 commented 2 weeks ago

Describe the bug We are using Azure java SDK storage APIs to upload large number of files (blobs) to Azure Blob Storage. To do this, we are creating multiple BlobClient instances using a single BlobContainerClient which is being cached and is a shared resource. Each thread creates its own BlobClient for each blob it needs to upload. However, we have observed high lock contention and performance degradation as the number of uploads increases. During load tests when creating the BlobContainerClient using SAS tokens, we observed that there is a high contention at java.util.HashTable.

image

When trying do the same load tests using connectionString authorization, we observed the contention at java.text.RuleBasedCollator.

image

These 2 images are taken when we do 2Mn calls to the blob storage when using one single BlobContainerClient and we create 2Mn separate BlobClients to upload each single blob.

Note: We cannot batch upload all the files to the blob storage service as our use case demands that we make separate requests to the blob service.

Exception or Stack Trace Provided in the JFR screenshots.

To Reproduce Create a file of random characters of 500 bytes and do separate Blob requests to a specific Blob container where you can use either connectionString authorization or SAS tokens. Create separate BlobClients for each Blob and when making the request, add a random character as your blobName. Submit all these tasks to an executorService where at least 3.5k threads are doing the blob calls.

Code Snippet

// just adding sample code to hammer the blob service with 2Mn requests
BlobContainerClient containerClient = new BlobContainerClientBuilder()
                .containerName(containerName)
                .connectionString(connectionString)
                .httpClient(httpClient) // create a specific one for high throughput
                .buildClient();
ExecutorService executorService = Executors.newFixedThreadPool(numThreads);
for (int i = 0; i < 2_000_000; i++) {
            executorService.submit(() -> {
                String blobName = "blob-" + i + ".txt";
                BlobClient blobClient = containerClient.getBlobClient(blobName);

                String data = getRandomString(); 
                ByteArrayInputStream dataStream = new ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8));

                blobClient.upload(BinaryData.fromBytes(data.getBytes()), true); // our usecase demands that we overwrite if the blob name is same by accident
            });
        }
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.HOURS);

Expected behavior It is understandable to have intermittent fluctuations in the request call latency to blob service but contention at the sdk level causes the throughput of the number of blobs uploaded to be reduced by a huge amount. Azure documentation says that the throughput that can be handled by storage account is 20k requests per sec but this contention seems to be a bottleneck in reaching this threshold.

Setup:

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

github-actions[bot] commented 2 weeks ago

@ibrahimrabab @ibrandes @kyleknap @seanmcc-msft

github-actions[bot] commented 2 weeks ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

alzimmermsft commented 2 weeks ago

Thanks for reporting this @saty101, I was able to validate this concern with a simpler reproduction.

For the SAS token-based usage, this can be reproduced using a thread pool of 5 thousand threads and 20 million new URL(String) calls. In my testing Java Flight Recorder reported over a day worth of lock contention.

For the connection string-based usage, this can be reproduced using a thread pool of 5 thousand, a static Collator.getInstance(Locale.ROOT), and 20 million calls to sort a list of strings. In my testing Java Flight Recorder reported many days worth of lock contention.

I'll look into this further on how the designs of these two code patch can be changed to reduce the amount of locking happening.