NuGet / Insights

Gather insights about public NuGet.org package data
Apache License 2.0
24 stars 7 forks source link

Use hex encoded file hash, not Base64 encoded #95

Closed zivkan closed 1 year ago

zivkan commented 1 year ago

The blobs on blob storage have a header Content-MD5, which is great, I can quickly check if my local copy is the same without downloading. However, the MD5 hash is base64 encoded, making it a different string than md5sum and openssl md5 on UNIX like operating systems, powershell's Get-FileHash -Algorithm MD5, or Windows' certutil.exe -fileHash <filename> md5.

I don't know if you'd consider changing the value, it would be a breaking change and would temporarily be confusing for clients querying blob storage directly to figure out if it's a base64 or hex encoded string.

At the very least, I'd like to raise awareness that there are many other tools that output hex encoded strings, and I'd like .NET developers to stop base64 encoding hash bytes (this is not the first app I've seen make this decision/"mistake", but every app I've seen that does this has always been a .NET app).

joelverhagen commented 1 year ago

Could you specify which base64 CSV columns are most helpful as hex?

In the Certificates table we store the fingerprint as base64 and hex (both). In many other places we prefer base64.

Generally, I wanted to keep the data size in Kusto to a minimum because (internal detail) we don't own the clusters and I wanted to minimize our impact on the partner team as much as possible. Another justification in my mind was the base64 to hex conversion is supported in Kusto so query writers could use hex if they wanted. But if you're processing the CSV blobs outside of Kusto then it makes sense the the base64 is painful since the base64 to hex conversion needs to be handled by you.

For backward compatibility in the table schema, we'd probably want to keep the existing base64 fields and add hex selectively where it's most useful.

zivkan commented 1 year ago

I'm referring to when you look at a file in blob storage explorer, it has a Content-MD5 header/property, which I assume is the MD5 hash of the raw bytes of the file, either gzipped or not (I haven't quite gotten far enough into my app to actually do the comparison yet). I was hoping to use it to ensure that my downloads are up to date, and not corrupt.

joelverhagen commented 1 year ago

Ah, sorry I misunderstood. Content-MD5 response header in Azure Blob Storage is mostly outside of our control. It's a Blob Storage feature and not a user-provided hash. Well technically the upload SDK calculates it and sends it along during blob upload and this is hardcoded to be base64. It's an interesting question on whether you could hack the SDK to send a hex hash in its place and whether the server would accept it. I don't know if the server confirms the hash correctness or enforces any schema. Maybe it's just a special user-provided spot for data!

Example in Blob REST docs: https://learn.microsoft.com/en-us/rest/api/storageservices/put-blob?tabs=azure-ad#sample-response

If we wanted your request for the CSV blobs, we would either a) need to test hacking a hex MD5 hash (or even hex SHA-256 for fun) into the blob upload request (Azure Blob SDK hacks) or b) stuff additional hash(es) into the user-provided blob metadata headers like x-ms-meta-SHA512.

Note that the latter solution is implemented on NuGet.org for SHA-512 hash since we can't use SHA-256. But, you guessed it, it's base64 encoded there too.

http HEAD https://api.nuget.org/v3-flatcontainer/newtonsoft.json/9.0.1/newtonsoft.json.9.0.1.nupkg
HTTP/1.1 200 OK
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: x-ms-request-id,Server,x-ms-version,Content-Length,Date,Transfer-Encoding
Age: 70694
Cache-Control: max-age=86400
Content-Length: 1613054
Content-MD5: gHvm5gDmsKYjl0zQ/unziA==
Content-Type: application/octet-stream
Date: Wed, 13 Sep 2023 14:57:59 GMT
Etag: 0x8D6323A24D3D246
Expires: Thu, 14 Sep 2023 14:57:59 GMT
Last-Modified: Mon, 15 Oct 2018 01:04:22 GMT
Server: ECAcc (cmh/38B5)
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-CDN-Rewrite: Root path in dist
X-Cache: HIT
X-Content-Type-Options: nosniff
x-ms-blob-type: BlockBlob
x-ms-lease-status: unlocked
x-ms-meta-SHA512: 2okXpTRwUcgQb06put5LwwCjtgoFo74zkPksjcvOpnIjx7TagGW5IoBCAA4luZx1+tfiIhoNqoiI7Y7zwWGyKA==
x-ms-meta-da7b2905_0f3c_4262_921c_b1593d1336f1_ESRP_RequestId: 9602ac1a-54ce-4959-bc3a-f5e53c2cf7f8
x-ms-request-id: f1ad0a77-901e-0010-26ae-e5789f000000
x-ms-version: 2009-09-19

Including x-ms-meta-MD5-hex in the blob is very do-able.

joelverhagen commented 1 year ago

Oh looks like Content-MD5 is more than a simple property. From the docs on the same page about Content-MD5:

Optional. An MD5 hash of the blob content. This hash is used to verify the integrity of the blob during transport. When this header is specified, the storage service checks the hash that has arrived against the one that was sent. If the two hashes don't match, the operation fails with error code 400 (Bad Request).

When the header is omitted in version 2012-02-12 or later, Blob Storage generates an MD5 hash.

It doesn't mention hex or base64 so there's a possibility is accepts and retains a hex shape but that would really surprise me. Seems like the x-ms-meta-* approach is better.

zivkan commented 1 year ago

Oh, my mistake. When I searched Insight's source code for the string "Content-MD5", it found some matches, so I assumed that Insights was setting it. Interesting (and personally disappointing to me) that yet another product is base64 encoding the hash, rather than hex encoding.

However, I'm currently playing with the Azure SDK, and it appears that its value is byte[], meaning it's actually an Azure Storage Explorer disaplay issue, not an error in Azure Storage's API.

Anyway, all the files I've looked at in Azure Storage Explorer has a Content-MD5 value, so I expect that will be sufficient for my needs.