Blob ContentHash appears not to be set/read for large files

Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.

MIT License

5.36k stars 4.79k forks source link

Blob ContentHash appears not to be set/read for large files #17676

Closed NeilMacMullen closed 3 years ago

NeilMacMullen commented 3 years ago

Describe the bug Some files in my container appear to have an empty BlobItem.Properties.ContentHash. I assume the problem occurs on upload but it's possible the issue is with reading files. Small files do not exhibit the problem. Large files do. The threshold appears to be around 50K. It's possible the problem is actual a side-effect of specifying an InitialTransferLength and MaximumTransferSize of 64K for upload/download in the StorageTransferOptions structure.

Note there is a similar issue at https://github.com/Azure/azure-sdk-for-net/issues/14037 but this describes a different mechanism.

Expected behavior I would expect the MD5 of a blob always to be available (why would you ever NOT want it set?) Even if not set by the client on upload it should be calculated by the server when storing the blob.

Actual behavior (include Exception or Stack Trace)

When reading back the blob properties using BlobContainer.GetBlobsAsync, the returned BlobItem entries contain an empty array for ContentHash when the file is larger than some threshold (64K? MaximumTransferSize? )

To Reproduce

Blobs are stored using this code...

        public async Task StoreBlob(string containerName, string blobName, ImmutableArray<byte> bytes)
        {
            BlockBlobClient blobReference = GetBlobAsync(containerName, blobName);

            var stream = new MemoryStream(bytes.ToArray());
            var options = new BlobUploadOptions
            {
                TransferOptions = TransferOptions
            };
            await blobReference.UploadAsync(stream, options).ConfigureAwait(false);
        }

Blobs are listed using this code ...

        public async Task<BlobItem[]> ListBlobs(string containerName)
        {
            BlobContainerClient client = FindContainer(containerName);

            var blobList = client.GetBlobsAsync();
            var blobs= new List<BlobItem>();

            var tasks = blobList.GetAsyncEnumerator();
            while (await tasks.MoveNextAsync().ConfigureAwait(false))
                blobs.Add(tasks.Current);

            // note that the issue exhibits itself as empty Properties.ContentHash fields for BlobItems describing large blobs

            return blobs.ToArray();
        }

where

  private static readonly StorageTransferOptions TransferOptions = new
            StorageTransferOptions
            {
                MaximumConcurrency = 20,
                InitialTransferLength = 0x10000,
                MaximumTransferSize = 0x10000
            };

Environment: Azure.Storage.Blobs 12.7.0 Windows 10 Net Core 3.0

jsquire commented 3 years ago

Thank you for your feedback. Tagging and routing to the team best able to assist. Please be aware that due to the holidays, responses may be delayed.

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details

**Describe the bug** Some files in my container appear to have an empty BlobItem.Properties.ContentHash. I assume the problem occurs on upload but it's possible the issue is with reading files. Small files do not exhibit the problem. Large files do. The threshold appears to be around 50K. It's possible the problem is actual a side-effect of specifying an InitialTransferLength and MaximumTransferSize of 64K for upload/download in the StorageTransferOptions structure. Note there is a similar issue at https://github.com/Azure/azure-sdk-for-net/issues/14037 but this describes a different mechanism. **Expected behavior** I would expect the MD5 of a blob always to be available (why would you ever NOT want it set?) Even if not set by the client on upload it should be calculated by the server when storing the blob. **Actual behavior (include Exception or Stack Trace)** When reading back the blob properties using BlobContainer.GetBlobsAsync, the returned BlobItem entries contain an empty array for ContentHash when the file is larger than some threshold (64K? MaximumTransferSize? ) **To Reproduce** Blobs are stored using this code... ``` public async Task StoreBlob(string containerName, string blobName, ImmutableArray bytes) { BlockBlobClient blobReference = GetBlobAsync(containerName, blobName); var stream = new MemoryStream(bytes.ToArray()); var options = new BlobUploadOptions { TransferOptions = TransferOptions }; await blobReference.UploadAsync(stream, options).ConfigureAwait(false); } ``` Blobs are listed using this code ... ``` public async Task ListBlobs(string containerName) { BlobContainerClient client = FindContainer(containerName); var blobList = client.GetBlobsAsync(); var blobs= new List(); var tasks = blobList.GetAsyncEnumerator(); while (await tasks.MoveNextAsync().ConfigureAwait(false)) blobs.Add(tasks.Current); // note that the issue exhibits itself as empty Properties.ContentHash fields for BlobItems describing large blobs return blobs.ToArray(); } ``` where ``` private static readonly StorageTransferOptions TransferOptions = new StorageTransferOptions { MaximumConcurrency = 20, InitialTransferLength = 0x10000, MaximumTransferSize = 0x10000 }; ``` **Environment:** Azure.Storage.Blobs 12.7.0 Windows 10 Net Core 3.0

Author:	NeilMacMullen
Assignees:	-
Labels:	`Client`, `Service Attention`, `Storage`, `customer-reported`, `needs-team-attention`, `needs-triage`, `question`
Milestone:	-

seanmcc-msft commented 3 years ago

Hi @NeilMacMullen, this is an artifact of how the SDK updates different sizes of data.

There are two ways to upload blob data into the storage service: Put Blob and Put Block / Put Blob List. When a blob is created with Put Blob, the service calculates the Content MD5, and the blob is essentially immutable. With Put Block and Put Blob List, the service doesn't calculate MD5, since new blocks and be added, and existing blocks can be re-arranged, the service would have to stream the entire content of the blob each time it was modified to calculate MD5.

The SDK uses Put Blob to upload blobs of less than 256MB. If the blob is data is > 256 MB or InitialTransferLength is specified, Put Block and Put Block List is used instead.

This behavior is unlikely to change in the future. Please re-open if you have further questions.

-Sean

NeilMacMullen commented 3 years ago

@seanmcc-msft Thanks for the explanation - if I have understood correctly then the workaround is simply to leave InitialTransferLength at its default value?

I appreciate the SDK is a work in progress but the design here appears "less than optimal". As far as I'm concerned as a user, I've just called "Upload" and in some circumstances I get a hashcode and in some cases I don't even when using exactly the same client code! I can't see how anyone would think this lack of consistency is a defensible API design. (This also shows in the linked issue where you either get or don't get a checksum depending on undocumented characteristics of the source stream.)

The SDK uses Put Blob to upload blobs of less than 256MB. If the blob is data is > 256 MB or InitialTransferLength is specified, Put Block and Put Block List is used instead.

On the face of it this seems..... not good Based on some opaque and undiscoverable criteria (to the user), the sdk is not only deciding whether or not a checksum will be generated but also changing the semantics of the blob I'm storing. One minute I think I'm storing an immutable blob and then simply by setting a parameter in a structure that I thought was tuning upload rate I've made it modifiable in the future!

My strong suggestion would be to make the behaviour less flexible and more predictable. If the checksum can't be reliably supplied by the server then make it explicit in the API that the client has to supply it.

FWIW the reason that I care about the checksum is that we treat a container as a virtual file-store. We have code for synchronising subsets of that file-store to a local machine or another file-store. Being able to rely on the presence of a checksum is obviously a huge win when doing this since we can avoid transferring blobs which are already present in the target.

seanmcc-msft commented 3 years ago

@NeilMacMullen, the work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash, and then call BlockBlobClient.Upload(stream, options).

I appreciate the SDK is a work in progress but the design here appears "less than optimal". As far as I'm concerned as a user, I've just called "Upload" and in some circumstances I get a hashcode and in some cases I don't even when using exactly the same client code! I can't see how anyone would think this lack of consistency is a defensible API design.

Our primary concern is not the service-generated checksum, it is making sure that most customer's Upload requests don't timeout. Many customers are on a slower internet connection, and large Put Blob requests will fail. In addition, it is faster and more efficient to do a multi-part upload.

@tg-msft @kasobol-msft

NeilMacMullen commented 3 years ago

work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash, and then call BlockBlobClient.Upload(stream, options)

Thanks @seanmcc-msft - that's useful to know. 👍

Our primary concern is not the service-generated checksum....

Understood :-) My point though is that the fact that the current implementation usually generates the checksum for the user is a "bad thing (TM)", especially for users of the earlier Storage library who (like me) have assumed that the checksum is a standard part of a blob and always available. In my case I had to hit on the particular combination of using TransferOptions (for performance reasons) and generating some files larger than 64K (thus forcing segmented upload). The poster in the original issue got tripped up by using "the wrong kind of stream".

As I said, it would be far preferable IMO to either never set it automatically or always set it. If that's not feasible then at least a large intellisense/documentation warning on the BlobItem.ContentHash property warning that the user can't rely on it being set automatically would seem to be called for.

bhavin07 commented 3 years ago

I have also experienced the same inconsistency where sometimes (for smaller files) I get ContentHash and for larger files I don't get ContentHash.

I am glad that I found this post.

I upvote @NeilMacMullen suggestion to make it absolutely clear on documentation for BlobItem.ContentHash

bhavin07 commented 3 years ago

To add to my previous comment.

work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash, and then call BlockBlobClient.Upload(stream, options)

I tried above for 2.9GB file. I intentionally supplied reversed md5Checksum in BlobUploadOptions.HttpHeaders.ContentHash

.
.
.
            Array.Reverse(mD5Checksum);

            var blobContentInfo = await blobClient.UploadAsync(new FileStream(localFilePath, FileMode.Open, FileAccess.Read), new BlobUploadOptions()
            {
                HttpHeaders = new BlobHttpHeaders() { ContentHash = mD5Checksum }
            });

            return (blobContentInfo.Value.VersionId, blobContentInfo.Value.ContentHash);

I was surprised to see 2 things: 1) Upload was successful despite wrong md5Cheksum supplied 2) ContentHash was not returned in the response

I am interested in ContentHash for the upload integrity. If ContentHash is not reliable, what is the best way to check the upload integrity?

@seankane-msft