NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.55k stars 641 forks source link

[Proposal] Add api endpoint to retreive the list of files that are in each .nupkg archive #7751

Open dannyvv opened 4 years ago

dannyvv commented 4 years ago

Add api endpoint to retrieve the file list of a NuGet package

This is a proposal to expose the list of files (and their file sizes) on the NuGet server api.

Motivation

There are various use cases for having the list of files in a NuGet package. This proposal lists them in detail below. This proposal is currently made to satisfy the need to optimize the build to avoid downloading all the packages before the build and have the build engine be the NuGet client that only download packages that are used by the build and the download happens interleaved during the build.

This might benefit other clients as well since if the list of files is exposed, one can create an implementation of NuGet.Packaging.PackageReaderBase that doesn't need to download the whole .nupk file. It just needs to use this proposed api and the existing exposed .nuspec file.

Spec

This will add a new endpoint on the Package Content family of endpoints. The spec already defines downloading of the .nupkg file and the .nuspec file There also have been extensions proposed on this api for icons and licenses

Please correct me if I'm wrong on the extra ones...

So an extra entry here for a new JSON document with the following Package Content based URL:

GET {@id}/{LOWER_ID}/{LOWER_VERSION}/packageContents.json

This will return a json document which contains a property called packageEntries. The packageEntries element is a JSON array of JSON objects, each object representing a packageEntry.

Name Type Required Notes
packageEntries array of object yes Each entry is a 'packageEntry'

The packageEntry leaf element is a JSON object with the following properties:

Name Type Required Notes
fullName string yes The full relative path in the package.
length integer yes The size of the file in bytes

The contents of fullName must use / as a path separator to match the paths for IPackageCoreReader

The fullName must also match the final extracted layed out on disk format. NuGet uses custom path encoding in the zip files. The paths here should be unencoded i.e. the api should return: lib/portable-net40+sl5+wp80+win8+wpa81/Newtonsoft.Json.dll, not lib/portable-net40+sl5+wp80+win8+wpa81/Newtonsoft.Json.dll as it is in the zip file.

I chose to use the names currently exposed on nuget.org. As in this sample. I have no preference for any naming scheme here.

The zip archive stores extra fields. The current nuget.org packageEntries field exposes these extra fields as well. They are not strictly necesary for the use cases below, but they might be useful. I have left them off for now to make it easier for implementors.

This option hopefully should not add too much of a burden to existing service implementation as sibling field dependencyGroups is exposed here which is only available in the 'nuspec' file, which has to be extracted from the full zip file that is uploaded.

Sample request

GET https://api.nuget.org/v3-flatcontainer/newtonsoft.json/6.0.4/packageContents.json
{
  "count": 19,
  "packageEntries": [
    {
      "fullName": "_rels/.rels",
      "length": 500
    },
    {
      "fullName": "Newtonsoft.Json.nuspec",
      "length": 682
    },
    {
      "fullName": "lib/net20/Newtonsoft.Json.dll",
      "length": 493056
    },
    {
      "fullName": "lib/net20/Newtonsoft.Json.xml",
      "length": 516034
    },
    {
      "fullName": "lib/net35/Newtonsoft.Json.dll",
      "length": 430080
    },
    {
      "fullName": "lib/net35/Newtonsoft.Json.xml",
      "length": 459802
    },
    {
      "fullName": "lib/net40/Newtonsoft.Json.dll",
      "length": 493056
    },
    {
      "fullName": "lib/net40/Newtonsoft.Json.xml",
      "length": 478726
    },
    {
      "fullName": "lib/net45/Newtonsoft.Json.dll",
      "length": 502272
    },
    {
      "fullName": "lib/net45/Newtonsoft.Json.xml",
      "length": 478726
    },
    {
      "fullName": "lib/netcore45/Newtonsoft.Json.dll",
      "length": 446976
    },
    {
      "fullName": "lib/netcore45/Newtonsoft.Json.xml",
      "length": 448530
    },
    {
      "fullName": "lib/portable-net40+sl5+wp80+win8+wpa81/Newtonsoft.Json.dll",
      "length": 387072
    },
    {
      "fullName": "lib/portable-net40+sl5+wp80+win8+wpa81/Newtonsoft.Json.xml",
      "length": 425586
    },
    {
      "fullName": "lib/portable-net45+wp80+win8+wpa81/Newtonsoft.Json.dll",
      "length": 444928
    },
    {
      "fullName": "lib/portable-net45+wp80+win8+wpa81/Newtonsoft.Json.xml",
      "length": 448549
    },
    {
      "fullName": "tools/install.ps1",
      "length": 3229
    },
    {
      "fullName": "package/services/metadata/core-properties/87a0a4e28d50417ea282e20f81bc6477.psmdcp",
      "length": 735
    },
    {
      "fullName": "[Content_Types].xml",
      "length": 566
    },
    {
      "fullName": ".signature.p7s",
      "length": 9463
    }
  ]
}

Alternative way to expose the data:

Expose the packageContents leaf element on the [package details catalog items](https://docs.microsoft.com/en-us/> nuget/api/catalog-resource#item-types). Since this is a new introduction it would have to be an > optional element, if we ref.

This option hopefully should not add too much of a burden to existing service implementation as sibling property dependencyGroups is exposed here which is only available in the 'nuspec' file, which > has to be extracted from the full zip file that is uploaded.

Use cases of this Api

Build engines with fine grained dependency management

Build engines with fine grained dependency management like BuildXL and Bazel and static graphs can benefit from having detailed file information from a NuGet package without downloading the package.

For these kind of build engines if during graph construction they can only download the metadata from the package to obtain its semantics (i.e. for NuGet this is encoded in the nuspec and the folder structure on disk inside the NuGet ) without fully downloading the zip file. These build engines can highly optimize the download and extraction of the consumed packages.

Frequently one doesn't build the entire tree and pass a 'filter expression' to the build. By either building only certain projects and their downstream dependents and/or upstream dependencies. Filter by a particular aspect like: codegen, compile, build, test, packaging. etc. Or for particular platforms or configurations. This allows the engine to optimize and not download any packages that are not needed by the current build.

Since these engines work with fine grained file dependencies they can ensure the packages are downloaded 'just in time' when the dependents actually need them. For example usually the first unittest starts somewhere half way during the build since they need to wait until some of their code dependencies are compiled, the engine can delay downloading of the packages needed to run the unittests until there are resources available or they are really needed to make build progress.

These engines have highly optimized schedulers that try to maximize the machine utilization but not overload it. They are great at mixing CPU heavy jobs with IO heavy operations to reduce overall build times. Package restore is usually pretty IO heavy, so the engine can interleave CPU heavy tasks like C++ compilation with the downloading of the packages.

These engines also tend to work distributed. I.e. the build is spread over more than one computer (workers). Currently if the restore has to happen before the build, the restore typically happens on every computer that is part of a distributed build. This can be up to 25 machines for large builds. This causes packages that are only used by one project to be downloaded on every worker machine, where they are only consumed on a single machine where that one job that needs it is run. Having the engine control the download of the package allows the engine to optimize this and only download the packages as needed on machines and can even optimize the distribution which jobs run on which machines to optimize

Virtual file system package client

Virtual File System implementations are getting traction across various platforms:

Various dev experiences have been built on top of these virtual file systems. These vary from complete dev environments being virtualized for source, packages intermediates and outputs. To just certain components like just the source files. For example VfsForGit.

One can envision a similar implementation for NuGet packages as well. Where the client on restore would lay out virtual entry points for the expanded packages on disk without downloading the full archive. When any of the files for a given package would be accessed by any client the virtual filesystem implementation could only then download that particular nupkg archive and extract it and place it on disk. This would reduce the number of downloads

Reverse file lookup helpers

One can envision a search tool that tries to find files in NuGet packages. For example to answer the question: Which packages have: System.Net.Http.dll embedded? Hint: There are many :) Today that search operation would have to pull all .nupkg files from the server.

Workarounds

The workaround for not having this api is to partially download the zip file and extract the file list from there. This would be pretty easy to do if the zip file had the file header at the start of the file, but the zip file format has the file manifest at the end of the file. Therefore one has to use HTTP range queries. This is all doable like it is done in MiniZip but one has to either redo all the authentication logic and throttling logic that is implemented in Nuget.Packaging or extend Nuget.Packaging to support this. One can also assume that there might be extra server load as one needs to download more chunks from the zip file that strictly is encoded in the file table and the server implementation might not be as efficient as downloading a single resource and the range operations likely will bypass any caching layers on the HTTP stack.

Potential Future extensions

Include file hash

Note: this is not a request, and I realize this can be expensive. Merely hypothesizing for future use cases

If in the future the each file in the returned file list could also carry an optional content hash (which algo tbd).

Name Type Required Notes
fileHash string no The hash of the file, encoding using standard base 64
fileHashAlgorithm string no

Potentially the fileHashAlgorithm can be a property of packageContents to not have to replicate it so many times.

This can help build engines with reliable cache implementations to perform cache lookups without having to download the nupkg as well. This will allow them to check if they have the results in the cache. For example if a NuGet package contains 'system.xyz.dll' with hash 'hXYZ' and it takes file 'a.cs' with hash 'hACS'. A build engine with a cache can check to see if the local cache (or the remote shared cache) already contains the output file 'a.dll' further reducing NuGet downloads.

Individual file download

Note: this is not a request, and I realize this can be expensive. Merely hypothesizing for future use cases

Often only a few files are needed from a package. Build engines (or NuGet clients) could decide to optimize their workflow by either downloading an individual file(s), or the whole archive.

Name Type Required Notes
downloadUrl string no The url where this file can be downloaded from
joelverhagen commented 4 years ago

Thanks for the detailed write-up, Danny! Tagging @jeffkl since he has good context on both sides.

Point-by-point

One can also assume that there might be extra server load as one needs to download more chunks from the zip file that strictly is encoded in the file table and the server implementation might not be as efficient as downloading a single resource and the range operations likely will bypass any caching layers on the HTTP stack.

From what I can tell our CDN handles range requests so it should be improved there. However client-side caching -- you're right it very well may fall over depending on the implementation.

Build engines with fine grained dependency management

It sounds like this is the scenario you are particularly interested in. Could you help me understand how build would determine if a package is used if it knows the list of files? In particular, is BuildXL aware of TFM compatibility and applicability of assets?

Could you provide a couple examples of when a package would or wouldn't be needed for a build?

One that comes to mind is if a package is used as a transitive dependency of an .exe you are building and the assemblies in that package would only be needed for runtime, not compile time. Is this what you are thinking?

Include file hash Individual file download

These two are more difficult requests since they require extracting all files of the ZIP. The file download URL is also tricky since this would greatly increase our storage consumption.

General feedback

In general, it sounds like discovering the list of files in a package could have an algorithm like this:

  1. If package is already local, use that .nupkg
  2. Check if the package has support packageContents.json
    • This could be indicated by something like PackageBaseAddress/3.1.0 in the service index
    • If so, download the listing file
  3. Check if the source supports range requests on the .nupkg with a HEAD
    • If so, do a trick like MiniZip or a generic seekable HTTP stream implementation
  4. Download the .nupkg

This would need to be tested for performance given varying sources, geos, and package sizes. For example a 50kb package should maybe just be fully downloaded.

Another thing to note is that V3 feeds implement new protocols slowly over time so any client that wants to utilize such a resource should have fallback behavior anyway.

dannyvv commented 4 years ago

Build engines with fine grained dependency management

It sounds like this is the scenario you are particularly interested in. Could you help me understand how build would determine if a package is used if it knows the list of files? In particular, is BuildXL aware of TFM compatibility and applicability of assets?

Correct. BuildXL uses DScript as the build specification. DScript has a notion of qualifiers. This is an extensible system for tfm, configuration, platform, rid or whatever the target language requires. It performs type safety checks to ensure that you don't refer to 'x64' code to 'x86' in the dscript language. Our NuGet integration emits DScript that using the PackageReaderBase class from Nuget.Packaging to extract the semantics of what files need to be copied for runtime, which files are needed for compile time, what analyzers, what content etc. (I'm actually in the process of revamping this by adding better support for the latest nuget features)

Could you provide a couple examples of when a package would or wouldn't be needed for a build?

One that comes to mind is if a package is used as a transitive dependency of an .exe you are building and the assemblies in that package would only be needed for runtime, not compile time. Is this what you are thinking?

This is indeed one example, we have many 'tools' nuget packages like Microsoft.Net.Compilers as well as internally packaged versions of many tools MsVsc compilers, Windows Sdk, PowerShell.Core etc.

The build engine has 2 phases: A scheduling phase and an execution phase. In the scheduling phase we evaluate DScript, understand nuget, parse ninja files, understand msbuild projects etc. Out of this comes a big graph with file based dependencies. Each node in the graph is a process execution with command line args, environment variables etc. As well as the files it will read and is exected to write. So for a managed unittest project ResGen.exe, Csc.exe, each copy file and the xunit invocation are all seperate nodes in the graph.

A build graph can be constructed from multiple qualifiers as well. i.e. you can mix x64, x86, net451, netcoreapp30 etc all in a single graph. If you have any processes that don't require a platform (i.e. codegen, or documentation, or anycpu) they will actually share the nodes between the qualifiers in the same graph and no duplicate work

So qualifiers is one way to filter the build graph of what you want to build. But one can also ask the build engine to simply build a single output file, a single project etc. In those cases the whole graph will be built, and then we inspect the graph. For instance if I ask to produce /f:output='out\bin\debug\win-x64\BuildXL.Utilities.dll only that file will be produced and only the tools to produce that file will be run. Since this project doesn't depend on any native code, we don't have to pull in the MsVsc nuget package, nor the windows Sdk, nor XUnit for the unittest projects etc.

So the typical clone, build a component dev-loop of the BuildXL Selfhost usually doesn't need all the 639 nuget packages, with 1.3 gig of nupkg file ~43k files extracted and ~8 gigs extracted.

Include file hash Individual file download

These two are more difficult requests since they require extracting all files of the ZIP. The file download URL is also tricky since this would greatly increase our storage consumption.

Totally understood. Hence under 'potential future extensions' I added a not to make it clearer for other readers.

General feedback

In general, it sounds like discovering the list of files in a package could have an algorithm like this:

  1. If package is already local, use that .nupkg
  2. Check if the package has support packageContents.json

    • This could be indicated by something like PackageBaseAddress/3.1.0 in the service index
    • If so, download the listing file
  3. Check if the source supports range requests on the .nupkg with a HEAD

    • If so, do a trick like MiniZip or a generic seekable HTTP stream implementation
  4. Download the .nupkg

This would need to be tested for performance given varying sources, geos, and package sizes. For example a 50kb package should maybe just be fully downloaded.

Another thing to note is that V3 feeds implement new protocols slowly over time so any client that wants to utilize such a resource should have fallback behavior anyway.

Totally understood, the implementation for now will only be in the BuildXL repo where our seflhost at the moment is the only customer. We only rely on nuget.org (for the public build) and azuredevops (for the internal build) for now. So I'll have to handle the fallback, but I can be aggressive (i.e. all or nothing) in my client implementation.

skofman1 commented 4 years ago

@dannyvv , thanks for the proposal! I will add this to our backlog. Please note that the team is busy working on other features, so if this is time sensitive, let's talk about contributing to nuget.org.

dannyvv commented 4 years ago

@dannyvv , thanks for the proposal! I will add this to our backlog. Please note that the team is busy working on other features, so if this is time sensitive, let's talk about contributing to nuget.org.

@skofman1: This is not time sensitive at all. I had offline discussions with @joelverhagen and he suggested to record the feature request. I can accomplish my goals per the 'workaround' section in the proposal.

Note: I'd be happy to contribute if the proposal is approved. As well as contribute it to AzureDevops as well.

anonhostpi commented 10 months ago

Any update on this?

I've considered using the C# implementation of PartialZip, but nuget apparently doesn't accept byte ranges in http requests.

Exposing the .nupkg's content directory record would allow for parsing the .nupkg's file records without ever actually unpacking it.

anonhostpi commented 10 months ago

Also, if you include the full path in the listed entries, it should allow build/import utilities (like my Import-Package for PowerShell) to determine if there are any runtime-id-specific (OS+arch-specific) files in the remote .nupkg.

It would also help mitigate downloading any unnecessary packages with _._ files

joelverhagen commented 10 months ago

@anonhostpi, you can try my side project MiniZip which uses range requests against NuGet's V3 API. Range requests are not a documented part of the V3 protocol (i.e. NuGet client does not use them) but they are currently supported via our Azure blob storage CDN origin.

https://github.com/joelverhagen/MiniZip

Although I am part of the NuGet team, this MiniZip project is not a NuGet project (it's my own side project). It's served me and others well for NuGet package analysis, however.

anonhostpi commented 10 months ago

I'll take a look at that. It's quite possible that PartialZip was just dated.

loic-sharma commented 10 months ago

In case it helps anyone, I created https://github.com/loic-sharma/NuGet.Assembly/ which extracts assemblies from NuGet packages into a content-addressable store. In 2019 this could process all of nuget.org in a few hours. The Try .NET team and I used this to experiment with READMEs that contained runnable C# scripts that depend on NuGet packages. One could use this to create your own API of packages' contents.

anonhostpi commented 10 months ago

@loic-sharma

In case it helps anyone, I created https://github.com/loic-sharma/NuGet.Assembly/ which extracts assemblies from NuGet packages into a content-addressable store. In 2019 this could process all of nuget.org in a few hours. The Try .NET team and I used this to experiment with READMEs that contained runnable C# scripts that depend on NuGet packages. One could use this to create your own API of packages' contents.

Not bad. I've noticed that it appears that its main mechanism is asynchronous extraction. Does it do partial extraction, full extraction or both?

anonhostpi commented 10 months ago

@joelverhagen

@anonhostpi, you can try my side project MiniZip which uses range requests against NuGet's V3 API. Range requests are not a documented part of the V3 protocol (i.e. NuGet client does not use them) but they are currently supported via our Azure blob storage CDN origin.

https://github.com/joelverhagen/MiniZip

Although I am part of the NuGet team, this MiniZip project is not a NuGet project (it's my own side project). It's served me and others well for NuGet package analysis, however.

Wow. I am very impressed with how thorough this library is. Since it is independently developed, I have to ask: how well do you expect to be able to keep this library maintained?

anonhostpi commented 10 months ago

@joelverhagen

I love it. I transpiled your README example to PowerShell (+Import-Package module):

# Import-Module Import-Package
Import-Package Knapcode.MiniZip # installs and loads Knapcode.MiniZip from PackageManagement

$url =  "https://api.nuget.org/v3-flatcontainer/newtonsoft.json/10.0.3/newtonsoft.json.10.0.3.nupkg"

$client = [System.Net.Http.HttpClient]::new()
$provider = [Knapcode.MiniZip.HttpZipProvider]::new($client)

$reader = $provider.GetReaderAsync( $url )
$reader = $reader.GetAwaiter().GetResult()

$dir_records = $reader.ReadAsync().GetAwaiter().GetResult()
$dir_records.Entries | % { [Knapcode.MiniZip.ZipEntryExtensions]::GetName($_) }

Very well done sir!

joelverhagen commented 10 months ago

Since it is independently developed, I have to ask: how well do you expect to be able to keep this library maintained?

@anonhostpi, I don't have any specific plans to stop maintaining it but both HTTP and ZIP are very stable so maintenance in my mind is pretty basic (adding features as I need them, accepting/reviewing PRs, and fixing bugs that bother me, etc.)

Also, I'm assuming that it is designed to work with any remote zip (just primarily focused on NuGet's API)?

Yes. I've tried it with Firefox extensions also and it worked fine.

Let's try to reduce noise on this thread (which is very specific to NuGet V3 and a new endpoint) and talk about MiniZip on that repo. I don't want to dilute the original proposal which is independent from a client library like MiniZip. Feel free to open an issue on that repo with follow-up questions and we can chat all you want.