Open dannyvv opened 4 years ago
Thanks for the detailed write-up, Danny! Tagging @jeffkl since he has good context on both sides.
One can also assume that there might be extra server load as one needs to download more chunks from the zip file that strictly is encoded in the file table and the server implementation might not be as efficient as downloading a single resource and the range operations likely will bypass any caching layers on the HTTP stack.
From what I can tell our CDN handles range requests so it should be improved there. However client-side caching -- you're right it very well may fall over depending on the implementation.
Build engines with fine grained dependency management
It sounds like this is the scenario you are particularly interested in. Could you help me understand how build would determine if a package is used if it knows the list of files? In particular, is BuildXL aware of TFM compatibility and applicability of assets?
Could you provide a couple examples of when a package would or wouldn't be needed for a build?
One that comes to mind is if a package is used as a transitive dependency of an .exe you are building and the assemblies in that package would only be needed for runtime, not compile time. Is this what you are thinking?
Include file hash Individual file download
These two are more difficult requests since they require extracting all files of the ZIP. The file download URL is also tricky since this would greatly increase our storage consumption.
In general, it sounds like discovering the list of files in a package could have an algorithm like this:
packageContents.json
PackageBaseAddress/3.1.0
in the service indexThis would need to be tested for performance given varying sources, geos, and package sizes. For example a 50kb package should maybe just be fully downloaded.
Another thing to note is that V3 feeds implement new protocols slowly over time so any client that wants to utilize such a resource should have fallback behavior anyway.
Build engines with fine grained dependency management
It sounds like this is the scenario you are particularly interested in. Could you help me understand how build would determine if a package is used if it knows the list of files? In particular, is BuildXL aware of TFM compatibility and applicability of assets?
Correct. BuildXL uses DScript as the build specification. DScript has a notion of qualifiers. This is an extensible system for tfm, configuration, platform, rid or whatever the target language requires. It performs type safety checks to ensure that you don't refer to 'x64' code to 'x86' in the dscript language. Our NuGet integration emits DScript that using the PackageReaderBase class from Nuget.Packaging to extract the semantics of what files need to be copied for runtime, which files are needed for compile time, what analyzers, what content etc. (I'm actually in the process of revamping this by adding better support for the latest nuget features)
Could you provide a couple examples of when a package would or wouldn't be needed for a build?
One that comes to mind is if a package is used as a transitive dependency of an .exe you are building and the assemblies in that package would only be needed for runtime, not compile time. Is this what you are thinking?
This is indeed one example, we have many 'tools' nuget packages like Microsoft.Net.Compilers as well as internally packaged versions of many tools MsVsc compilers, Windows Sdk, PowerShell.Core etc.
The build engine has 2 phases: A scheduling phase and an execution phase. In the scheduling phase we evaluate DScript, understand nuget, parse ninja files, understand msbuild projects etc. Out of this comes a big graph with file based dependencies. Each node in the graph is a process execution with command line args, environment variables etc. As well as the files it will read and is exected to write. So for a managed unittest project ResGen.exe, Csc.exe, each copy file and the xunit invocation are all seperate nodes in the graph.
A build graph can be constructed from multiple qualifiers as well. i.e. you can mix x64, x86, net451, netcoreapp30 etc all in a single graph. If you have any processes that don't require a platform (i.e. codegen, or documentation, or anycpu) they will actually share the nodes between the qualifiers in the same graph and no duplicate work
So qualifiers is one way to filter the build graph of what you want to build. But one can also ask the build engine to simply build a single output file, a single project etc. In those cases the whole graph will be built, and then we inspect the graph. For instance if I ask to produce /f:output='out\bin\debug\win-x64\BuildXL.Utilities.dll
only that file will be produced and only the tools to produce that file will be run. Since this project doesn't depend on any native code, we don't have to pull in the MsVsc nuget package, nor the windows Sdk, nor XUnit for the unittest projects etc.
So the typical clone, build a component dev-loop of the BuildXL Selfhost usually doesn't need all the 639 nuget packages, with 1.3 gig of nupkg file ~43k files extracted and ~8 gigs extracted.
Include file hash Individual file download
These two are more difficult requests since they require extracting all files of the ZIP. The file download URL is also tricky since this would greatly increase our storage consumption.
Totally understood. Hence under 'potential future extensions' I added a not to make it clearer for other readers.
General feedback
In general, it sounds like discovering the list of files in a package could have an algorithm like this:
- If package is already local, use that .nupkg
Check if the package has support
packageContents.json
- This could be indicated by something like
PackageBaseAddress/3.1.0
in the service index- If so, download the listing file
Check if the source supports range requests on the .nupkg with a HEAD
- If so, do a trick like MiniZip or a generic seekable HTTP stream implementation
- Download the .nupkg
This would need to be tested for performance given varying sources, geos, and package sizes. For example a 50kb package should maybe just be fully downloaded.
Another thing to note is that V3 feeds implement new protocols slowly over time so any client that wants to utilize such a resource should have fallback behavior anyway.
Totally understood, the implementation for now will only be in the BuildXL repo where our seflhost at the moment is the only customer. We only rely on nuget.org (for the public build) and azuredevops (for the internal build) for now. So I'll have to handle the fallback, but I can be aggressive (i.e. all or nothing) in my client implementation.
@dannyvv , thanks for the proposal! I will add this to our backlog. Please note that the team is busy working on other features, so if this is time sensitive, let's talk about contributing to nuget.org.
@dannyvv , thanks for the proposal! I will add this to our backlog. Please note that the team is busy working on other features, so if this is time sensitive, let's talk about contributing to nuget.org.
@skofman1: This is not time sensitive at all. I had offline discussions with @joelverhagen and he suggested to record the feature request. I can accomplish my goals per the 'workaround' section in the proposal.
Note: I'd be happy to contribute if the proposal is approved. As well as contribute it to AzureDevops as well.
Any update on this?
I've considered using the C# implementation of PartialZip, but nuget apparently doesn't accept byte ranges in http requests.
Exposing the .nupkg's content directory record would allow for parsing the .nupkg's file records without ever actually unpacking it.
Also, if you include the full path in the listed entries, it should allow build/import utilities (like my Import-Package for PowerShell) to determine if there are any runtime-id-specific (OS+arch-specific) files in the remote .nupkg.
It would also help mitigate downloading any unnecessary packages with _._
files
@anonhostpi, you can try my side project MiniZip which uses range requests against NuGet's V3 API. Range requests are not a documented part of the V3 protocol (i.e. NuGet client does not use them) but they are currently supported via our Azure blob storage CDN origin.
https://github.com/joelverhagen/MiniZip
Although I am part of the NuGet team, this MiniZip project is not a NuGet project (it's my own side project). It's served me and others well for NuGet package analysis, however.
I'll take a look at that. It's quite possible that PartialZip was just dated.
In case it helps anyone, I created https://github.com/loic-sharma/NuGet.Assembly/ which extracts assemblies from NuGet packages into a content-addressable store. In 2019 this could process all of nuget.org in a few hours. The Try .NET team and I used this to experiment with READMEs that contained runnable C# scripts that depend on NuGet packages. One could use this to create your own API of packages' contents.
@loic-sharma
In case it helps anyone, I created https://github.com/loic-sharma/NuGet.Assembly/ which extracts assemblies from NuGet packages into a content-addressable store. In 2019 this could process all of nuget.org in a few hours. The Try .NET team and I used this to experiment with READMEs that contained runnable C# scripts that depend on NuGet packages. One could use this to create your own API of packages' contents.
Not bad. I've noticed that it appears that its main mechanism is asynchronous extraction. Does it do partial extraction, full extraction or both?
@joelverhagen
@anonhostpi, you can try my side project MiniZip which uses range requests against NuGet's V3 API. Range requests are not a documented part of the V3 protocol (i.e. NuGet client does not use them) but they are currently supported via our Azure blob storage CDN origin.
https://github.com/joelverhagen/MiniZip
Although I am part of the NuGet team, this MiniZip project is not a NuGet project (it's my own side project). It's served me and others well for NuGet package analysis, however.
Wow. I am very impressed with how thorough this library is. Since it is independently developed, I have to ask: how well do you expect to be able to keep this library maintained?
@joelverhagen
I love it. I transpiled your README example to PowerShell (+Import-Package module):
# Import-Module Import-Package
Import-Package Knapcode.MiniZip # installs and loads Knapcode.MiniZip from PackageManagement
$url = "https://api.nuget.org/v3-flatcontainer/newtonsoft.json/10.0.3/newtonsoft.json.10.0.3.nupkg"
$client = [System.Net.Http.HttpClient]::new()
$provider = [Knapcode.MiniZip.HttpZipProvider]::new($client)
$reader = $provider.GetReaderAsync( $url )
$reader = $reader.GetAwaiter().GetResult()
$dir_records = $reader.ReadAsync().GetAwaiter().GetResult()
$dir_records.Entries | % { [Knapcode.MiniZip.ZipEntryExtensions]::GetName($_) }
Very well done sir!
Since it is independently developed, I have to ask: how well do you expect to be able to keep this library maintained?
@anonhostpi, I don't have any specific plans to stop maintaining it but both HTTP and ZIP are very stable so maintenance in my mind is pretty basic (adding features as I need them, accepting/reviewing PRs, and fixing bugs that bother me, etc.)
Also, I'm assuming that it is designed to work with any remote zip (just primarily focused on NuGet's API)?
Yes. I've tried it with Firefox extensions also and it worked fine.
Let's try to reduce noise on this thread (which is very specific to NuGet V3 and a new endpoint) and talk about MiniZip on that repo. I don't want to dilute the original proposal which is independent from a client library like MiniZip. Feel free to open an issue on that repo with follow-up questions and we can chat all you want.
Add api endpoint to retrieve the file list of a NuGet package
This is a proposal to expose the list of files (and their file sizes) on the NuGet server api.
Motivation
There are various use cases for having the list of files in a NuGet package. This proposal lists them in detail below. This proposal is currently made to satisfy the need to optimize the build to avoid downloading all the packages before the build and have the build engine be the NuGet client that only download packages that are used by the build and the download happens interleaved during the build.
This might benefit other clients as well since if the list of files is exposed, one can create an implementation of NuGet.Packaging.PackageReaderBase that doesn't need to download the whole .nupk file. It just needs to use this proposed api and the existing exposed .nuspec file.
Spec
This will add a new endpoint on the Package Content family of endpoints. The spec already defines downloading of the
.nupkg
file and the.nuspec
file There also have been extensions proposed on this api for icons and licensesSo an extra entry here for a new JSON document with the following Package Content based URL:
This will return a json document which contains a property called
packageEntries
. ThepackageEntries
element is a JSON array of JSON objects, each object representing apackageEntry
.The
packageEntry
leaf element is a JSON object with the following properties:The contents of
fullName
must use/
as a path separator to match the paths for IPackageCoreReaderThe
fullName
must also match the final extracted layed out on disk format. NuGet uses custom path encoding in the zip files. The paths here should be unencoded i.e. the api should return:lib/portable-net40+sl5+wp80+win8+wpa81/Newtonsoft.Json.dll
, notlib/portable-net40+sl5+wp80+win8+wpa81/Newtonsoft.Json.dll
as it is in the zip file.Sample request
Use cases of this Api
Build engines with fine grained dependency management
Build engines with fine grained dependency management like BuildXL and Bazel and static graphs can benefit from having detailed file information from a NuGet package without downloading the package.
For these kind of build engines if during graph construction they can only download the metadata from the package to obtain its semantics (i.e. for NuGet this is encoded in the nuspec and the folder structure on disk inside the NuGet ) without fully downloading the zip file. These build engines can highly optimize the download and extraction of the consumed packages.
Frequently one doesn't build the entire tree and pass a 'filter expression' to the build. By either building only certain projects and their downstream dependents and/or upstream dependencies. Filter by a particular aspect like: codegen, compile, build, test, packaging. etc. Or for particular platforms or configurations. This allows the engine to optimize and not download any packages that are not needed by the current build.
Since these engines work with fine grained file dependencies they can ensure the packages are downloaded 'just in time' when the dependents actually need them. For example usually the first unittest starts somewhere half way during the build since they need to wait until some of their code dependencies are compiled, the engine can delay downloading of the packages needed to run the unittests until there are resources available or they are really needed to make build progress.
These engines have highly optimized schedulers that try to maximize the machine utilization but not overload it. They are great at mixing CPU heavy jobs with IO heavy operations to reduce overall build times. Package restore is usually pretty IO heavy, so the engine can interleave CPU heavy tasks like C++ compilation with the downloading of the packages.
These engines also tend to work distributed. I.e. the build is spread over more than one computer (workers). Currently if the restore has to happen before the build, the restore typically happens on every computer that is part of a distributed build. This can be up to 25 machines for large builds. This causes packages that are only used by one project to be downloaded on every worker machine, where they are only consumed on a single machine where that one job that needs it is run. Having the engine control the download of the package allows the engine to optimize this and only download the packages as needed on machines and can even optimize the distribution which jobs run on which machines to optimize
Virtual file system package client
Virtual File System implementations are getting traction across various platforms:
Various dev experiences have been built on top of these virtual file systems. These vary from complete dev environments being virtualized for source, packages intermediates and outputs. To just certain components like just the source files. For example VfsForGit.
One can envision a similar implementation for NuGet packages as well. Where the client on restore would lay out virtual entry points for the expanded packages on disk without downloading the full archive. When any of the files for a given package would be accessed by any client the virtual filesystem implementation could only then download that particular nupkg archive and extract it and place it on disk. This would reduce the number of downloads
Reverse file lookup helpers
One can envision a search tool that tries to find files in NuGet packages. For example to answer the question: Which packages have:
System.Net.Http.dll
embedded? Hint: There are many :) Today that search operation would have to pull all .nupkg files from the server.Workarounds
The workaround for not having this api is to partially download the zip file and extract the file list from there. This would be pretty easy to do if the zip file had the file header at the start of the file, but the zip file format has the file manifest at the end of the file. Therefore one has to use HTTP range queries. This is all doable like it is done in MiniZip but one has to either redo all the authentication logic and throttling logic that is implemented in Nuget.Packaging or extend Nuget.Packaging to support this. One can also assume that there might be extra server load as one needs to download more chunks from the zip file that strictly is encoded in the file table and the server implementation might not be as efficient as downloading a single resource and the range operations likely will bypass any caching layers on the HTTP stack.
Potential Future extensions
Include file hash
If in the future the each file in the returned file list could also carry an optional content hash (which algo tbd).
This can help build engines with reliable cache implementations to perform cache lookups without having to download the nupkg as well. This will allow them to check if they have the results in the cache. For example if a NuGet package contains 'system.xyz.dll' with hash 'hXYZ' and it takes file 'a.cs' with hash 'hACS'. A build engine with a cache can check to see if the local cache (or the remote shared cache) already contains the output file 'a.dll' further reducing NuGet downloads.
Individual file download
Often only a few files are needed from a package. Build engines (or NuGet clients) could decide to optimize their workflow by either downloading an individual file(s), or the whole archive.