Closed NeilMacMullen closed 3 years ago
Thank you for your feedback. Tagging and routing to the team best able to assist. Please be aware that due to the holidays, responses may be delayed.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.
Author: | NeilMacMullen |
---|---|
Assignees: | - |
Labels: | `Client`, `Service Attention`, `Storage`, `customer-reported`, `needs-team-attention`, `needs-triage`, `question` |
Milestone: | - |
Hi @NeilMacMullen, this is an artifact of how the SDK updates different sizes of data.
There are two ways to upload blob data into the storage service: Put Blob and Put Block / Put Blob List. When a blob is created with Put Blob, the service calculates the Content MD5, and the blob is essentially immutable. With Put Block and Put Blob List, the service doesn't calculate MD5, since new blocks and be added, and existing blocks can be re-arranged, the service would have to stream the entire content of the blob each time it was modified to calculate MD5.
The SDK uses Put Blob to upload blobs of less than 256MB. If the blob is data is > 256 MB or InitialTransferLength
is specified, Put Block and Put Block List is used instead.
This behavior is unlikely to change in the future. Please re-open if you have further questions.
-Sean
@seanmcc-msft Thanks for the explanation - if I have understood correctly then the workaround is simply to leave InitialTransferLength at its default value?
I appreciate the SDK is a work in progress but the design here appears "less than optimal". As far as I'm concerned as a user, I've just called "Upload" and in some circumstances I get a hashcode and in some cases I don't even when using exactly the same client code! I can't see how anyone would think this lack of consistency is a defensible API design. (This also shows in the linked issue where you either get or don't get a checksum depending on undocumented characteristics of the source stream.)
The SDK uses Put Blob to upload blobs of less than 256MB. If the blob is data is > 256 MB or InitialTransferLength is specified, Put Block and Put Block List is used instead.
On the face of it this seems..... not good Based on some opaque and undiscoverable criteria (to the user), the sdk is not only deciding whether or not a checksum will be generated but also changing the semantics of the blob I'm storing. One minute I think I'm storing an immutable blob and then simply by setting a parameter in a structure that I thought was tuning upload rate I've made it modifiable in the future!
My strong suggestion would be to make the behaviour less flexible and more predictable. If the checksum can't be reliably supplied by the server then make it explicit in the API that the client has to supply it.
FWIW the reason that I care about the checksum is that we treat a container as a virtual file-store. We have code for synchronising subsets of that file-store to a local machine or another file-store. Being able to rely on the presence of a checksum is obviously a huge win when doing this since we can avoid transferring blobs which are already present in the target.
@NeilMacMullen, the work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash
, and then call BlockBlobClient.Upload(stream, options)
.
I appreciate the SDK is a work in progress but the design here appears "less than optimal". As far as I'm concerned as a user, I've just called "Upload" and in some circumstances I get a hashcode and in some cases I don't even when using exactly the same client code! I can't see how anyone would think this lack of consistency is a defensible API design.
Our primary concern is not the service-generated checksum, it is making sure that most customer's Upload requests don't timeout. Many customers are on a slower internet connection, and large Put Blob requests will fail. In addition, it is faster and more efficient to do a multi-part upload.
@tg-msft @kasobol-msft
work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash, and then call BlockBlobClient.Upload(stream, options)
Thanks @seanmcc-msft - that's useful to know. 👍
Our primary concern is not the service-generated checksum....
Understood :-) My point though is that the fact that the current implementation usually generates the checksum for the user is a "bad thing (TM)", especially for users of the earlier Storage library who (like me) have assumed that the checksum is a standard part of a blob and always available. In my case I had to hit on the particular combination of using TransferOptions (for performance reasons) and generating some files larger than 64K (thus forcing segmented upload). The poster in the original issue got tripped up by using "the wrong kind of stream".
As I said, it would be far preferable IMO to either never set it automatically or always set it. If that's not feasible then at least a large intellisense/documentation warning on the BlobItem.ContentHash property warning that the user can't rely on it being set automatically would seem to be called for.
I have also experienced the same inconsistency where sometimes (for smaller files) I get ContentHash and for larger files I don't get ContentHash.
I am glad that I found this post.
I upvote @NeilMacMullen suggestion to make it absolutely clear on documentation for BlobItem.ContentHash
To add to my previous comment.
work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash, and then call BlockBlobClient.Upload(stream, options)
I tried above for 2.9GB file. I intentionally supplied reversed md5Checksum in BlobUploadOptions.HttpHeaders.ContentHash
.
.
.
Array.Reverse(mD5Checksum);
var blobContentInfo = await blobClient.UploadAsync(new FileStream(localFilePath, FileMode.Open, FileAccess.Read), new BlobUploadOptions()
{
HttpHeaders = new BlobHttpHeaders() { ContentHash = mD5Checksum }
});
return (blobContentInfo.Value.VersionId, blobContentInfo.Value.ContentHash);
I was surprised to see 2 things: 1) Upload was successful despite wrong md5Cheksum supplied 2) ContentHash was not returned in the response
I am interested in ContentHash for the upload integrity. If ContentHash is not reliable, what is the best way to check the upload integrity?
@seankane-msft
Describe the bug Some files in my container appear to have an empty BlobItem.Properties.ContentHash. I assume the problem occurs on upload but it's possible the issue is with reading files. Small files do not exhibit the problem. Large files do. The threshold appears to be around 50K. It's possible the problem is actual a side-effect of specifying an InitialTransferLength and MaximumTransferSize of 64K for upload/download in the StorageTransferOptions structure.
Note there is a similar issue at https://github.com/Azure/azure-sdk-for-net/issues/14037 but this describes a different mechanism.
Expected behavior I would expect the MD5 of a blob always to be available (why would you ever NOT want it set?) Even if not set by the client on upload it should be calculated by the server when storing the blob.
Actual behavior (include Exception or Stack Trace)
When reading back the blob properties using BlobContainer.GetBlobsAsync, the returned BlobItem entries contain an empty array for ContentHash when the file is larger than some threshold (64K? MaximumTransferSize? )
To Reproduce
Blobs are stored using this code...
Blobs are listed using this code ...
where
Environment: Azure.Storage.Blobs 12.7.0 Windows 10 Net Core 3.0