Open audouts opened 3 years ago
This is an intended behavior. The ETag value for composite version is not based on MD5. Checkout https://cloud.google.com/storage/docs/composite-objects#metadata
I think you misunderstood my post. As I described, I replaced the file with a non-composite version and I verified that the ETag is using the MD5. Only gsutil shows the wrong result.
Look at the second version and you'll see there is no Component-Count
.
Sorry for misunderstanding the issue.
To be clear, are you expecting the ETag to be same as MD5Hash? ETag for all GCS objects will be different from their MD5 Hash. You can query the GCS API directly and you will find that the ETag is different than the MD5 hash for the objects. https://cloud.google.com/storage/docs/json_api/v1/objects/get?apix=true
No problem.
The ETag is definitely the same as the MD5 hash. I verified that with cURL, which returned this:
date: Fri, 05 Feb 2021 01:40:16 GMT
cache-control: private, max-age=0
last-modified: Thu, 04 Feb 2021 00:28:54 GMT
etag: "81b366bfdde43cef22521dc69315ec90"
That is exactly what CloudBerry shows:
But gsutil
still shows the wrong thing:
Update time: Thu, 04 Feb 2021 00:28:54 GMT
Hash (crc32c): 83ZSpQ==
Hash (md5): gbNmv93kPO8iUh3GkxXskA==
ETag: CIG2l8/8zu4CEAE=
And for verification, the Hash (md5):
field above is base64-encoded binary but resolves to the hex value: 81b366bfdde43cef22521dc69315ec90
. That's the same as the ETag shown by cURL and CloudBerry.
I'm pretty sure this is working as designed, though it's fair to ask whether the design makes sense:
I believe what you're experiencing is differences in how eTags are implemented within the XML and JSON APIs. The documentation is somewhat vague on the point, but says (albeit indirectly) that you should only expect eTags and MD5 to match when using the XML API (https://cloud.google.com/storage/docs/hashes-etags#_ETags). For the JSON API, which is likely the value being returned by gsutil, it appears to be a different value. FWIW, the RFC (https://tools.ietf.org/html/rfc7232#section-2.3) says eTags are "opaque validators", so it seems unfortunate that different APIs return different values, but not necessarily wrong.
You can check this theory by making a direct call via the JSON API using the API Explorer here: https://cloud.google.com/storage/docs/json_api/v1/objects/get. The eTag returned there, I suspect, will match the eTag you see in gsutil.
@starsandskies
Thanks. I think you're right about using the JSON API having the same result. The example in API Explorer doesn't use a key/secret pair for authentication. Is there an examples that does?
And what is that value that gsutil returns? What purpose does it have if I can't calculate it or somehow validate it?
It seems problematic to me that there are conflicting ETag values, depending on which URL I use to access the object. At the very least, gsutil should have an option to return the other value.
Let me preface this by saying that I am very shaky when it comes to auth stuff and the related terminology, so what I'm about to say could easily be either wrong or not what you're asking about: My understanding is that for the JSON API, you can only authenticate with OAuth 2.0 tokens, and you can only use RSA keys to create the tokens - Signatures won't work with the JSON API, and HMAC keys can't be used to generate OAuth tokens.
As for the purpose, my understanding (again, admittedly limited) is that it's only real purpose is to track changes in the object data over time. The RFC reads: "An entity-tag is an opaque validator for differentiating between multiple representations of the same resource, regardless of whether those multiple representations are due to resource state changes over time, content negotiation resulting in multiple representations being valid at the same time, or both." I take that to mean you use the eTag to confirm that the object now is the same as the object was at some earlier point in time.
For data validation, I believe both XML and JSON report MD5 and CRC32c values (except for composite objects, which don't have MD5 values), though my knowledge (one last time) is not extensive when it comes to data validation. There's some documentation about it here, but it's middling in quality.
Last, but not least - I think there is a way to return the other value! Since the other value comes from going through the XML API, I believe you can set gsutil to preferentially make requests through it. You can find a short bit of info here to permanently set the preference, or you can use the -o top-level option to set it for a single command.
Thanks for your help! I'll look into the options. Hopefully, I can get gsutil
l to use the XML interface.
I came to the same conclusion in my limited understanding - that the ETag is meant to track changes. To me, the use-case of a hash is obvious - it allows me to compare a local file (or other copy) and be sure that what I have matches the server version. That also seems like the most common change-tracking that people would need and so a hash as the ETag seems like a good method. An arbitrary value that I don't know how to reproduce seems far less useful but there certainly may be valid use-cases that I haven't considered.
Per the suggestion from @starsandskies, I tried changing my .boto
configuration:
prefer_api = json
This gave me the result that I wanted - having the expected, MD5 ETag.
That leaves me with two concerns, which I think are worth considering:
In the future, I might be able to use the CRC/MD5 header but it's very unhelpful to have it encoded.
gsutil version: 4.58
I have files that were uploaded in multipart mode. The result was "composite" files, which is not at all desirable. When I inspect the files using my normal client app (CloudBerry), I can see the component count
x-goog-component-count
, an unknown ETag, etc. When I replace this file with a normal upload, component count goes away, as expected and I can see an MD5 hash for the ETag.However, when viewing the files with
gsutil ls -L
orgsutil ls -le
, I see a different ETag that is wrong. This ETag only shows up withgsutil
. Getting the header when I download the file shows the MD5 and my client app shows the MD5.