[BUG]azure-storage: High Lock Contention for concurrent Blob Uploads at scale

saty101 commented 2 months ago

Describe the bug We are using Azure java SDK storage APIs to upload large number of files (blobs) to Azure Blob Storage. To do this, we are creating multiple BlobClient instances using a single BlobContainerClient which is being cached and is a shared resource. Each thread creates its own BlobClient for each blob it needs to upload. However, we have observed high lock contention and performance degradation as the number of uploads increases. During load tests when creating the BlobContainerClient using SAS tokens, we observed that there is a high contention at java.util.HashTable.

When trying do the same load tests using connectionString authorization, we observed the contention at java.text.RuleBasedCollator.

These 2 images are taken when we do 2Mn calls to the blob storage when using one single BlobContainerClient and we create 2Mn separate BlobClients to upload each single blob.

Note: We cannot batch upload all the files to the blob storage service as our use case demands that we make separate requests to the blob service.

Exception or Stack Trace Provided in the JFR screenshots.

To Reproduce Create a file of random characters of 500 bytes and do separate Blob requests to a specific Blob container where you can use either connectionString authorization or SAS tokens. Create separate BlobClients for each Blob and when making the request, add a random character as your blobName. Submit all these tasks to an executorService where at least 3.5k threads are doing the blob calls.

Code Snippet

// just adding sample code to hammer the blob service with 2Mn requests
BlobContainerClient containerClient = new BlobContainerClientBuilder()
                .containerName(containerName)
                .connectionString(connectionString)
                .httpClient(httpClient) // create a specific one for high throughput
                .buildClient();
ExecutorService executorService = Executors.newFixedThreadPool(numThreads);
for (int i = 0; i < 2_000_000; i++) {
            executorService.submit(() -> {
                String blobName = "blob-" + i + ".txt";
                BlobClient blobClient = containerClient.getBlobClient(blobName);

                String data = getRandomString(); 
                ByteArrayInputStream dataStream = new ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8));

                blobClient.upload(BinaryData.fromBytes(data.getBytes()), true); // our usecase demands that we overwrite if the blob name is same by accident
            });
        }
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.HOURS);

Expected behavior It is understandable to have intermittent fluctuations in the request call latency to blob service but contention at the sdk level causes the throughput of the number of blobs uploaded to be reduced by a huge amount. Azure documentation says that the throughput that can be handled by storage account is 20k requests per sec but this contention seems to be a bottleneck in reaching this threshold.

Setup:

OS: Linux
IDE: IntelliJ
Library/Libraries: com.azure:azure-storage-blob:12.25.3
Java version: JDK 17

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

[x] Bug Description Added
[x] Repro Steps Added
[x] Setup information Added

github-actions[bot] commented 2 months ago

@ibrahimrabab @ibrandes @kyleknap @seanmcc-msft

github-actions[bot] commented 2 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

alzimmermsft commented 2 months ago

Thanks for reporting this @saty101, I was able to validate this concern with a simpler reproduction.

For the SAS token-based usage, this can be reproduced using a thread pool of 5 thousand threads and 20 million new URL(String) calls. In my testing Java Flight Recorder reported over a day worth of lock contention.

For the connection string-based usage, this can be reproduced using a thread pool of 5 thousand, a static Collator.getInstance(Locale.ROOT), and 20 million calls to sort a list of strings. In my testing Java Flight Recorder reported many days worth of lock contention.

I'll look into this further on how the designs of these two code patch can be changed to reduce the amount of locking happening.

saty101 commented 1 month ago

Hey @alzimmermsft is there any update regarding this bug?

alzimmermsft commented 1 month ago

Hi @saty101, a change was made to azure-core 1.53.0 to circumvent the HashTable lookup during URL creation in UrlBuilder. A change to reduce contention in StorageSharedKeyCredential is only available in a preview release 12.28.0-beta.1, which can be inspected for resolving your issue until it releases GA.

saty101 commented 1 month ago

Hey @alzimmermsft I tested my use case with preview release 12.28.0-beta.1 and it definitely showed better performance for large scale PUT operations to the blob storage. Just wondering when can we expect to get this on GA release?

alzimmermsft commented 1 month ago

@saty101, there is a GA release, which should include this fix, planned for some time in the middle of November.

alzimmermsft commented 2 weeks ago

This should be available in azure-storage-blob 12.29.0.

Azure / azure-sdk-for-java

[BUG]azure-storage: High Lock Contention for concurrent Blob Uploads at scale #41798