Azure / azure-storage-java

Microsoft Azure Storage Library for Java
https://docs.microsoft.com/en-us/java/api/overview/azure/storage
MIT License
189 stars 163 forks source link

CONTENT-MD5 is missing for Block Upload in AZURE portal #495

Closed Rameshkubendran closed 4 years ago

Rameshkubendran commented 5 years ago

Which service(blob, file, queue, table) does this issue concern?

There is an issue in blob service.

Which version of the SDK was used?

We are using JAVA 8.

Please note that if your issue is with v11, we are recommending customers either move back to v11 or move to v12 (currently in preview) if at all possible. Hopefully this resolves your issue, but if there is some reason why moving away from v11 is not possible at this time, please do continue to ask your question and we will do our best to support you. The README for this SDK has been updated to point to more information on why we have made this decision.

What problem was encountered ?

CONTENT-MD5 is missing in azure portal when we are uploading big file in blob as block/chunk. It looks like Azure is not updating CONTENT-MD5 by default for block upload as single upload.

Since content-md5 is not updating we are getting an exception "Blob has mismatch (integrity check failed), Expected value is m5hM3x8grCYBgNAue/RYnA==, retrieved CMWQgUAgrLKtUYC3VLD+hw== " when are downloading /reading content from blob as we need to validate the content integrity while downloading.

Other details are here... Version : Azure-storage 7.0.0 .
Language : Java 8

Have you found a mitigation/solution?

Option 1: We are generating the content MD5 for entire file from our end and setting in to blob property before uploading a file. Its working as expected. But are getting an Out of memory issue when we are uploading big file size.

Md-5 Code Snap Shot: // blobContentInputStream is inputStream byte [] blobContentBytes = IOUtils.toByteArray(blobContentInputStream);
//Generating MD5 of the blob content. MessageDigest md = MessageDigest.getInstance(“MD5”); md.reset(); md.update(blobContentBytes); // Encode the md5 content using Base64 encoding String base64EncodedMD5content = Base64.encode(md.digest()); // set blob properties and assign md5 content fileInBlob.getProperties().setContentMD5(base64EncodedMD5content); // fileInBlob is CloudBlockBlob object

Option 2: To make azure to calculate and update the content MD-5 internally , we tried to enable the following property StoreBlobContentMD5 and UseTransactionalMD5 in BlobRequestOptions which is not working for us.

Approach 1: BlobRequestOptions b = new BlobRequestOptions(); b.setStoreBlobContentMD5(true); b.setUseTransactionalContentMD5(true); fileInBlob.uploadBlock(blockIdEncoded, contentInputStream, contentInputStream.available(), accessCondition, b, null); // fileInBlob is CloudBlockBlob object

Approach 2: Disabling UseTransactionalContentMD5 is False BlobRequestOptions b = new BlobRequestOptions(); b.setStoreBlobContentMD5(true); b.setUseTransactionalContentMD5(false); fileInBlob.uploadBlock(blockIdEncoded, contentInputStream, contentInputStream.available(), accessCondition, b, null); // fileInBlob is CloudBlockBlob object

Clarification: How to make azure to calculate Content-MD5 internally while uploading as a block for big file size similar single upload ? Note: We are good if we are able to generating MD5 for whole file as we are going to validate whole file while downloading, instead of each block.

Thanks, Ramesh Kubendran

jaschrep-msft commented 5 years ago

If I understand correctly, you are trying to:

  1. Upload a large file in one shot to a block blob with the SDK using BlobOutputStream
  2. Update a single block in the large block blob using stageBlock()/commitBlockList()
  3. Download the entire updated blob with a check on the MD5 (this is what's failing)

Is this correct?

Rameshkubendran commented 5 years ago

I am not uploading entire file in a single shot. since its a big file I am uploading as block/chunk. I will ask my question here in different way ...

Issue : content-md5 value is missing in azure blob property when we are uploading a file as block/chunk . (not in single upload or shot). So, how we can resolve this issue ?

Code Reference is here... while (contentInputStream.available() > 100 1024 1024) { blockIdEncoded = Base64.getEncoder().encodeToString(String.format("%05d", blockNum).getBytes(Charset.forName(ENCODING_TYPE))); fileInBlob.uploadBlock(blockIdEncoded, contentInputStream, 100 1024 1024, accessCondition, null, null); blockList.add(new BlockEntry(blockIdEncoded)); blockNum++; } blockIdEncoded = Base64.getEncoder().encodeToString(String.format("%05d", blockNum).getBytes(Charset.forName(ENCODING_TYPE))); fileInBlob.uploadBlock(blockIdEncoded, contentInputStream, contentInputStream.available(), accessCondition, null, null); blockList.add(new BlockEntry(blockIdEncoded)); fileInBlob.commitBlockList(blockList, accessCondition, null, null);

jaschrep-msft commented 4 years ago

Apologies for the delay.

Content md5 is only stored by the service, and you cannot get it to calculate the md5 for you*. Your option one was the correct approach: to calculate the md5 locally and set the property.

In regards to out of memory exceptions when the file is large enough, you do not need the entire file in memory to calculate its md5. The MessageDigest class was designed to consume arbitrary amounts of data in its computation. If you look at the javadocs, you can see a code example where they call update() multiple times to produce a single md5, and can do so arbitrarily many times. They also attempt to show off the clone functionality in this sample to demonstrate clone functionality available on certain hash algorithms, but you do not need that for this use case.

Does this resolve your issue?

*If your blob is beneath a certain threshold in size, then the service will allow this on single-shot uploads. I believe this number is in the tens of megabytes.

Rameshkubendran commented 4 years ago

Thank you. I am generating md5 explicitly (locally) and setting in to blob property. it's working as expected.

Regarding OOM I am using DigestInputStream instead dealing with input stream directly. It's working fine..

rickle-msft commented 4 years ago

I am going to close this issue as it seems all discussed issues have bee resolved. @Rameshkubendran please feel free to comment here further or open another issue if you need further support.

rdp commented 2 years ago

OK I put a write up of what I believe is "what is possible" with azure md5 checking here, in case useful for followers https://stackoverflow.com/a/69319211/32453