[Feature] Increase the package size limit on NuGet.org from 250 MB

AsakusaRinne commented 1 year ago

NuGet Product(s) Affected

NuGet.exe, Other/NA

Current Behavior

The NuGet package has a size limitation of 250M.

Desired Behavior

Change the limitation to 500M or higher.

Additional Context

As mentioned in https://github.com/NuGet/Home/issues/6208#issuecomment-1516206607, several years have passed and it's easy for machine learning packages to be larger than 250M. For example, I'm one of the authors of Tensorflow.NET, the linux cuda binary reaches 400M (after compressed by NuGet). It's a big inconvenient if users have to download the binary them self and find the right place to put it.

Please 👍 or 👎 this comment to help us with the direction of this feature & leave as much feedback/questions/concerns as you'd like on this issue itself and we will get back to you shortly.

joelverhagen commented 1 year ago

I'm going to move this to NuGet/NuGetGallery since the 250 MB restriction is enforced on NuGet.org, not by client tooling.

joelverhagen commented 1 year ago

@AsakusaRinne, storing ML models on NuGet.org hasn't been fully thought through or designed from the service and tooling standpoint. Generally speaking, our software works best for our primary use case: .NET packages containing a relatively small payload. Currently, packages in the 150 MB+ range on NuGet.org account for less than 0.1%:

This is the current distribution (all sizes in MB) P25	P50	P75	P90	P95	P99	P99.9
0.02	0.04	0.17	1.07	4.37	34.44	144.57

I use this to demonstrate how NuGet.org is being used today. More than 95% of all packages are less than 5 MB (i.e. very far from the current limit of 250 MB).

I'll give you some examples of things that don't work as well for very large packages:

Package upload is non-resumable. The entire upload must pass in one attempt. This is challenging for folks with less reliable or lower bandwidth connections. For this to work better a chunked or block upload model could help. This is implemented on Azure Blob Storage with blocks but it's not supported in the package upload protocol or tooling.
Some of our backend jobs won't scale well for very large packages. For example, we have some cross-region uploads which (similar to item 1) require the entire upload to success in one go.
The user packages folder populated by client tooling currently has no automatic clean-up mechanism. This already leads to a lot of disk consumption. This situation could become much worse for large packages that release new versions frequently. Ideally there would be a LRU or automatic purging process for very big packages or packages that are updated by the consumer very frequently to reduce unnecessary disk space consumption.
Similar to item 1, package downloads are non-resumable. The entire package download must succeed in one attempt. Ideally the client would download sections of the package with, say, Range requests allow smaller parts of the package to be resumed and retries. This is a similar approach that many download managers utilize.

All in all, the NuGet tooling (both server and client) doesn't work well for large packages, especially in non-ideal network conditions. For us to ship large package support and have it be a great experience for the majority of our users, I think we'd need to do some work to improve our software.

If we simply increase the limit to a much higher value (for example 1 GB), my team would undoubtedly get even more reports and customer support requests about package upload or download issues. And the best mitigation we'll be able to provide in this case is "try again" or "use a faster internet connection". For users, especially our less experienced or less technical users, this isn't a very helpful answer and would lead to frustration.

That being said, the currently limit was selected several years ago, so the majority of our users may be in a better position to handle a larger limit now. I don't know of any data to support this theory in our user cohort, but I think it's a safe assumption. There could be a middle ground where we made no changes to our software and simply increase the limit by a modest amount. Even this would require some testing to ensure our backend service can handle this change effectively. It's hard to know what that new limit should be. We'd need to select it confidently and avoid any take-backs (i.e. reverting the change due to unexpected problems).

In the short term, you can consider working around the problem by downloading the ML model (or, generally, any large file that can't fit inside the package) in an MSBuild target (docs). I've seen the Uno.Wasm.Bootstrap package do this to download an extra toolchain that is outside of the package. Here's an example: Uno.Wasm.Bootstrap.targets, UnoInstallSDKTask.cs. You of course will need to host the data file yourself, but this could ease the installation process by automating it at build time. And, importantly, doing a download at build time may be seen as unexpected behavior so be sure to document this behavior at the top of your package README and description so package consumers are not surprised by this flow (as mentioned in our recent blog post unexpected behavior). Again, this is only a workaround and will need to be implemented and maintained by the package author.

Another workaround would be to host your own package feed that contains these packages and instruct your users to add your "large package" or "ML data" feed as an additional package source in their client tooling. For more information, see Hosting your own NuGet feeds.

We'll leave this issue open since it is indeed unresolved, and we'll gauge the priority w.r.t. to our other work based on upvotes.

AsakusaRinne commented 1 year ago

Thanks a lot for your answer. :) I'll try MSBuild or other ways as a work around.

JonDouglas commented 1 year ago

@AsakusaRinne There is an undocumented runtime.json feature that allows you to split things into a meta package + many per-RID packages.

I can't find much documentation on it, but this blog might be a good start.

https://natemcmaster.com/blog/2016/05/19/nuget3-rid-graph/

I've updated the parent comment to help collect upvotes. Thank you for filing this issue!

nietras commented 1 year ago

Note that https://github.com/dotnet/TorchSharp as far as I know has the same issue with nuget package size limits and NVidia cuDNN dlls being too large (a single dll is too large e.g. cudnn_cnn_infer64_8.dll at 414 MB uncompressed). They then went to great lengths to do they own splitting of these files and creating multiple nuget packages for the parts and then having some custom target that recombines it. It would be great if this multi-part packages was a supported scenario if size limit is not being raised. (I note that Azure DevOps has a 512 MB limit instead of 256 MB).

You can see this in https://www.nuget.org/packages/libtorch-cuda-11.7-win-x64/

AsakusaRinne commented 1 year ago

After learning from TorchSharp and untime.json, we have been able to publish our package now! Thank you for all your help!

A small tip for others with same problem: rather than config everything in the repo, I choose to manually make the nuget package by Nuget Package Explorer, which is much easier. In this way you only need to replace the native library files every time you publish a new version. Since the native library is only updated every few months, it's acceptable for me to manually update it.

nietras commented 1 year ago

@AsakusaRinne is that package public? Would be great to have the example for reference 😊

AsakusaRinne commented 1 year ago

@AsakusaRinne is that package public? Would be great to have the example for reference 😊

Sure, here it is: SciSharp.Tensorflow.Redist-Linux-GPU.2.11.0

The package depends on four other packages: primary, fragment1,fragment2 and fragment3.

The fragment packages contain the fragments of the large file, while primary holds the sha256 hash of the original file.

When the project is built, the .targets file in the primary will work to combine the fragments and check if the reconstructed file has the same sha256 with original file. If the sha256 is correct, it will put the file in the output path of your project.

If there is anyone else facing the same problem in the future, there's a simple way to re-use our nupkg (or the one of TorchSharp):

Install Nuget Package Explorer.
Split your file into several parts and get the sha256 hash of the original file. Here's a code doing this (or find it in the folder tools/Tensorflow.Redist.NativeLibrarySplitter of tensorflow.net after the PR is merged).
Open the fragment package with Nuget Package Explorer and replace the fragment files with yours (please take care of the file naming rule).
Open the primary package and replace the sha256 hash file with yours, and rename the empty file libtensorflow.so to your original file name.
Open the main package (SciSharp.Tensorflow.Redist-Linux-GPU.2.11.0.nupkg) and modify its dependencies.

After these steps, the package will be ready for publish (but test locally first). It's a plain approach but is simple to make it. Hope it could help you.

agr commented 1 year ago

Just throwing some ideas: ZIP format (which is used for .nupkg files) supports "volumes" enabling splitting larger archives into smaller chunks. That might be considered as a potential way to enable larger packages without concerns of having to be able to download large files in one go. However, it would also likely require protocol update (as well as support from both client and server) and backwards compatibility issues.

nietras commented 1 year ago

FYI regarding runtime.json.

"Improve handling of native packages (Support RID specific dependencies)" https://github.com/NuGet/Home/issues/10571 discusses issues related to this and shows how https://www.nuget.org/packages/libclang is packaged via the runtime.json trick.

"Should runtime. packages be listed in NuGet.org?" https://github.com/dotnet/core/issues/7568 similarly discusses issues around this and points to https://www.nuget.org/packages/Microsoft.NETCore.App package which has multiple runtime specific "sub-packages" https://www.nuget.org/packages?q=Microsoft.NETCoreApp.Runtime

For libclang the runtime.json in the meta package looks like:

{
  "runtimes": {
    "linux-arm64": {
      "libclang": {
        "libclang.runtime.linux-arm64": "16.0.6"
      }
    },
    "linux-x64": {
      "libclang": {
        "libclang.runtime.linux-x64": "16.0.6"
      }
    },
    "osx-arm64": {
      "libclang": {
        "libclang.runtime.osx-arm64": "16.0.6"
      }
    },
    "osx-x64": {
      "libclang": {
        "libclang.runtime.osx-x64": "16.0.6"
      }
    },
    "win-arm64": {
      "libclang": {
        "libclang.runtime.win-arm64": "16.0.6"
      }
    },
    "win-x64": {
      "libclang": {
        "libclang.runtime.win-x64": "16.0.6"
      }
    },
    "win-x86": {
      "libclang": {
        "libclang.runtime.win-x86": "16.0.6"
      }
    }
  }
}

JonDouglas commented 10 months ago

Hi folks,

I'm late to this one although I've had various conversations about it in the last years. I propose that we double the package size limit on NuGet.org for a number of reasons.

Native libraries are getting bigger. The authors of these packages want to maintain less "fragments" or "partial" packages as the average size per fragment is ~200MB+.
AI has boomed. The AI efforts for .NET could benefit from being able to ship reasonably sized models in NuGet that are 250-500MB in size. The alternative only happened recently of a preview here: https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/streamline-your-mlops-practice-with-model-packages/ba-p/3959684 (this is especially important for tensor native libraries too)
Azure DevOps supports 500mb. no idea on github packages.
Package sizes continue to increase in general. That's just the nature of our world. A rough idea is that many packages were a fraction of their size a few years ago. Also known as a take on Wirth's Law

In an effort to accommodate larger libraries, make it easier to distribute, reduce the need for splitting, and remain competitive, these reasons should be enough to consider this change to help interim while other products are introduced in the market to solve specific problems like hosting large AI models and distributing them appropriately.

The cost will be that we won't have great answers to scaling challenges as mentioned by Joel earlier in this thread. I think majority of people would be okay with that given the benefit outweighs the cost right now.

alexrp commented 7 months ago

A greater package size limit would also be helpful for my packaging of Zig toolsets: https://www.nuget.org/packages?q=vezel.zig.toolsets

Right now, I publish a package per build machine RID (not target RID!), so runtime.json does not apply here. The result is that users have to jump through some annoying hoops to get the right package for their build machine, which also makes it hard to integrate usage of these toolset packages into other packages that contain build files.

I'm hoping to combine these packages into a single package one day, as it would significantly simplify the user experience, and allow the toolset packages to be used in more scenarios. But as you can see, each package is already ~75 MB, so combining 10 of them immediately runs into the package size limit.

moljac commented 5 months ago

To add another one:

The same problem with TensorFlow bindings for .NET Android in GooglePlayServices

https://github.com/xamarin/GooglePlayServicesComponents

luisquintanilla commented 1 month ago

Leveraging NuGet for hosting packages that include ML models simplifies a few workflows:

Deployment / Acquisition - .NET developers are familiar with publishing and consuming NuGet packages.
Security - Signing packages that include ML models adds an additional layer of trust / security compared to current practices which involve trusting that the URLs weights are being downloaded at runtime from are trusted.
Versioning - Updates to the model weights don't break existing applications. Similar to how upgrading dependencies work today, developers can choose when to upgrade to newer versions of the model. Packages with older versions of the model weights will continue to work.
Training / Inferencing - Generally ML model artifacts only include model weights. It's then up to the consumer to write any pre/post-processing code required to get data into the model as well as map model outputs to meaningful representations. By shipping packages that include ML models, model creators can include code for their models to simplify training / inferencing for consumers.

NuGet / NuGetGallery