NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.54k stars 644 forks source link

[Feature request] Make package hash more accessible through the V3 API #9433

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

NuGet Product Used

NuGet SDK

Product Version

6.0.400

Worked before?

No response

Impact

It's more difficult to complete my work

Repro Steps & Context

There is no option to collect the packageHash of a NuGet in the API V3 short of:

  1. Doing a registration call followed by catalog call (which is a slow thing) https://api.nuget.org/v3/registration5-gz-semver2/newtonsoft.json/12.0.3.json then https://api.nuget.org/v3/catalog0/data/2022.12.08.16.43.03/newtonsoft.json.12.0.3.json

  2. Calling the (deprecated) V2 XML api instead as in https://www.nuget.org/api/v2/Packages(Id='newtonsoft.json',Version='12.0.3'). This does provide the full details including packageHash in a single call.

As a result the only available options to collect a hash (in order to make an important nupkg integrity verification) are either slow or deprecated.

Note that the same is true when using the SDK and making calls to the NuGet Client libraries.

Please add the package hash to the v3 API

Verbose Logs

No response

erdembayar commented 1 year ago

I'm transferring to Gallery looks like protocol related feature request.

joelverhagen commented 1 year ago

Hey @pombredanne, thanks for taking the time to open this. To be clear this is a feature request, not a bug. By design, the package metadata ("registration") URL you mentioned does not contain the hash. Let's leave this issue open to collect upvotes (which we use, in part, to prioritize work).

Generally speaking, the shape of the V3 API and therefore the shape of the NuGet client libraries (namely NuGet.Protocol) is informed by what the NuGet client needs to complete officially supported tooling scenarios. Up to this point, there has been no need to have the package hash, so that's why it was never added to the package metadata endpoint.

Could you help me understand your scenario? Why do you need V3 API hash such that the additional HTTP request to the catalog is not tolerable?

In the meantime, I recommend continuing to use the catalog call. If you want to use an undocumented approach, you can perform an HTTP HEAD on the .nupkg URL image

I don't recommend this approach since it's possible we will remove this header later (it's an implementation detail -- and if you break on it later we will offer no support), but if you want a small, single request approach right now you can choose to take that risk.

pombredanne commented 1 year ago

@joelverhagen Thanks for the reply! I have been per-using your code and insights otherwise, so kudos to you overall. :heart:

The use case is to support https://github.com/nexB/nuget-inspector that is a self-contained standalone tool that can take modern and legacy project files and various artifacts to resolve a NuGet dependency tree without having any of dotnet, MBuild or nuget clients installed and this on any OS.

It is designed to be used in many other tools including ScanCode.io https://github.com/nexB/scancode.io/ and ORT https://github.com/oss-review-toolkit/ort/pull/6209#issuecomment-1448219091

As a standalone tool, it can be integrated in analysis pipelines where the analyzed code is not in a buildable state or may be legacy or else. Basically it is designed to try hard to resolve NuGet dependencies in almost all cases you can throw at it.

We are collecting there as many upstream metadata as possible to support verification, matching and validation as part of tooling that support open source supply chain management, intelligence, licensing and security.

This is part of a larger suite of "inspectors" such as https://github.com/nexB/python-inspector and https://github.com/nexB/debian-inspector that have similar ecosystem-specific capabilities to resolve dependencies using the native library of an ecosystem.

My 2 cents wrt. the NuGet APIs and based on extensive experience on building package data collection and integration systems with many ecosystems is that I find them harder to deal with than other application package repository APIs such as PyPI, Rubygems or npm, requiring multiple calls to get the data and having multiple somewhat confusing endpoint each doing mostly but not exactly the same thing.

I would typically have to do a single API call with most other APIs when I know a package identity (name/version) or (name) and would get all metadata in return. I would also expect to have specialized APIs to support efficient and optimal dependency resolution with as little data returned as needed, but such data should include at least checksums to validate the downloads integrity. In particular, there are cases where we can can TOCTOU situations where we have first a a lookup to get a dependency tree and later we fetch actual code packages (e.g., .nupkg here) and want to validate that what we get is what we did intend to fetch.

pombredanne commented 1 year ago

@joelverhagen re:

In the meantime, I recommend continuing to use the catalog call. If you want to use an undocumented approach, you can perform an HTTP HEAD on the .nupkg URL

Is this what the NuGet.Client uses?

joelverhagen commented 1 year ago

My 2 cents wrt. the NuGet APIs and based on extensive experience on building package data collection and integration systems with many ecosystems is that I find them harder to deal with than other application package repository APIs such as PyPI, Rubygems or npm, requiring multiple calls to get the data and having multiple somewhat confusing endpoint each doing mostly but not exactly the same thing.

We've gotten this feedback before. I agree with you. The V3 API docs (https://learn.microsoft.com/en-us/nuget/api/overview) that I originally wrote were an effort to mitigate this problem somewhat. Previously neither the V2 nor the V3 APIs were documented.

If my team had more capacity or if there was a stronger need in our user base, I would like to make some of these enhancements so that NuGet.org's data is more accessible and more open. But the reality is that we're a relatively small team and our primary goal is to enable a specific set of NuGet-based features, mostly centering around .NET client tooling experiences in the .NET CLI and Visual Studio. Broadening our scope even a little bit will have upfront and maintenance costs that we can't afford right now.

Concerning your particular scenario, I think one of the reasons that hashes or checksums aren't needed by client today is not because we don't care about the integrity of the content. Instead, our solutions to threats that I believe you're alluding to are more around package signing (integrity and authenticity), package source mapping (determinism), and a general hardening of NuGet.org's ingress and egress flows (in many ways I can't go into here).

I'll also note that the structure of the V3 protocol is really not optimized for ease of use (to put it bluntly). This isn't an intentional barrier put in place for newcomers, but rather a design decision to make scaling our service possible for a small team. Our architecture is simple "compute-less" blob storage + CDN layout. This means our costs are very low (mostly bandwidth) and we're not worrying about application server scaling for our most critical scenarios. But this does mean adding a new property to, say, the registration JSON means recomputing millions of JSON blobs. That's one more hurdle to adding a new hash property to the JSON.

If you'd truly like an optimized view of NuGet.org, you can consider building your own projection using this guide: https://learn.microsoft.com/en-us/nuget/guides/api/query-for-all-published-packages

This is what I did in Insights and how our backend jobs populate the various official V3 endpoints.

Is this what the NuGet.Client uses?

Yes and no. NuGet Client uses that URL via the package content endpoint (https://learn.microsoft.com/en-us/nuget/api/package-base-address-resource#download-package-content-nupkg), however NuGet client doesn't send a HEAD request (it uses GET) and it doesn't use that hash response header (it just uses the response body).

pombredanne commented 1 year ago

Our architecture is simple "compute-less" blob storage + CDN layout.

Ah! thank you for this insight (and the others too :+1: ). It helps a lot to better grok the design choices.

So I wonder if the blob storage has a way to create aliases or "symlinks" of sorts that may not be too costly: if so, it may be possible to alias the latest catalog page of a given identity to something like "latest"?

e.g., "https://api.nuget.org/v3/catalog0/data/2022.12.08.16.43.03/newtonsoft.json.12.0.3.json" would become accessible through "https://api.nuget.org/v3/catalog0/data/latest/newtonsoft.json.12.0.3.json" too?

joelverhagen commented 1 year ago

As far as I know, there's no such aliasing in blob storage. But there is a fast blob copy operation which should be a similar amount of IO as updating a symlink.

Generally, we wouldn't want to extend the catalog for this purpose. Catalog is meant to be an append only log not a structure optimized for random access. Additionally, nothing about the catalog protocol requires that each event is about a package so it's unclear what that similink would do for non-package related events. But this is getting very hypothetical.

I think the right fix here is to actually update the registration blobs. But as I mentioned before, it'll be hard to justify this work.

Having a totally new resource which simply points to the latest catalog leaf would also be reasonable.

Until we have the capacity and priority to implement this idea, anyone in the community is free to implement their own catalog reader that creates this pointer on their own blob storage or even a web service that does the redirect.

joelverhagen commented 1 year ago

Another thing I should add is that any enhancement to the catalog protocol is very unlikely to be picked up by another package source implementation such as Azure DevOps since they don't implement the catalog at all. Enhancements to existing resources has the benefit of being easier for existing package source implementations to pick up.

pombredanne commented 1 year ago

@joelverhagen Thank you ++.