chocolatey / choco

Chocolatey - the package manager for Windows
https://chocolatey.org
Other
10.38k stars 903 forks source link

Cache download url objects #1390

Open dhoer opened 7 years ago

dhoer commented 7 years ago

@ferventcoder thanks for your response on https://stackoverflow.com/questions/45867716/nexus-to-serve-up-chocolately-packages about recompile packages. That is good to know, but it got me thinking that there is a code smell with this approach. Choco is touching the original source. Why not cache the download urls? Choco already has a convention to require url, checksum and checksum type for x86 and/or x64 installs, so it could be possible to do this. This approach doesn't molest the original choco package.

Since the word cache is already used, let's call this stash for the purpose of this conversation. Feel free to change it.

Stash will store objects downloaded from urls defined in choco package. The stash will be in nuget format with url hashed and appended after version (I don't know nuget but I think that should work, if not, then you get the general gist).

Stash Command

choco stash [list]|add|remove|disable|enable [<options/switches>]

Examples

choco stash add -n=global -s="https://somewhere/out/there/api/v2/" -k="'314132413251235'"
choco stash add --default -n=development -s="https://somewhere/else/out/there/api/v2/" -k="'231238747123423'" 
choco stash remove -n=stage
choco stash disable -n=global
choco stash enable -n=global

Options/Switches

-n, --name=VALUE
     Name - the name of the stash. Required with some actions. Defaults to 
       empty.

-s, --source=VALUE
     Source - The source. This can be a folder/file share or an http location. 
       If it is a url, it will be a location you can go to in a browser and 
       it returns OData with something that says Packages in the browser, 
       similar to what you see when you go to https://chocolatey.org/api/v2/. 

 -k, --key=VALUE
     Key - The Source's key. Encrypted in chocolatey.config file.

     --default 
     Default stash - Only one stash can be default.  First stash added
        to list will automatically be set as default.  Stashed urls will use  
        default when --stash-name is not provided.

     --ignore-stash
     Ignore Stash - A comma delimited list of URIs that will be ignored. By
       default source is automatically ignored.

Choco install/upgrade

Options/Switches

     --no-stash
     No Stash - Will not stash object downloaded even if stash is enabled

     --stash-name
     Stash Name - Name of stash to store downloaded object.  If not provided,
     the default stash is used.

Algorythm

This seems like a cleaner approach, but I don't know all the ins and outs like you do.

ferventcoder commented 7 years ago

Choco is touching the original source. Why not cache the download urls? Choco already has a convention to require url, checksum and checksum type for x86 and/or x64 installs, so it could be possible to do this.

No convention here - it's requirement due to distribution rights. Keep in mind that the community package repository is but one Chocolatey repository in a sea of thousands. Even with 5K packages, it is a tip of the iceberg in packages. The rest are all internal, and that represents a much larger portion of packaging. The actual convention is to embed the binaries directly in the package for the utmost in reliability. We are adding this to choco new in 0.10.8 so that folks better understand the conventions - https://gist.github.com/ferventcoder/dac662b6ae05f93ff22e4a093dbb56d0#file-_todo-txt

To give you a better grasp of what I mean by tip of the iceberg, https://www.slideshare.net/ferventcoder/webinar-chocolatey-package-management-with-proget/18

ferventcoder commented 7 years ago

Why not cache the download urls?

Chocolatey already does this. That's what cacheLocation is. Perhaps it's best to read over https://stackoverflow.com/a/18596173/18475 to get a good understanding of options available.

dhoer commented 7 years ago

I might be a little slow here. So cache can be configured to point to a nexus hosted repo? It looks like it is for a local server where install is occurring. I know I can cache the choco package on nexus by proxying chocolatey.org, but I don't understand how to configure choco cache to use nexus.

And I understand that some packages like jdk8 may not be cacheable due to url/url64 not being used (this is because of Oracle requiring cookies to be set in order to download). And maybe those have to be internalized? But following 80/20 rule, I'm sure most installers are using the url/url64 settings.

dhoer commented 7 years ago

I take it this is where cache logic is: Get-PackageCacheLocation. If I knew powershell and windows I would submit a PR to add this functionality, but I struggled just to write a config file.

But wow, this would be nice to have. If this existed, I would probably add "Chocolatey Cache" hosted repo on nexus and push to it when there is a cache miss.

dhoer commented 7 years ago

The cache would be for internal use. I wouldn't expect chocolatey.org to have a cache due to distribution rights. But this shouldn't be an issue for private internal caches.

AdmiringWorm commented 7 years ago

I'm going to cherry-pick a few things I think I may answer.

So cache can be configured to point to a nexus hosted repo? It looks like it is for a local server where install is occurring

Not with how the existing caching works (AFAIK), the current caching is for previously downloaded executables/archives which are then stored on the users local computer (which is then checked, and if all checks don't report any failures the previously downloaded executable/archive is used). Anyways, why not add that nexus hosted repo as an additional source? choco source add -n=nexus -s="https://where.i.am.located/api" (a file path can also be used)

And I understand that some packages like jdk8 may not be cacheable due to url/url64 not being used

If the package is using the built-in chocolatey download helper it's cachable (which I believe it does).

And maybe those have to be internalized?

That should be done with most packages, as long as you can do exactly that.

I take it this is where cache logic is: Get-PackageCacheLocation.

No, sorry. That is a helper function for packages to get the cache location to use when downloading files that need extra care, it's not meant to be used outside of packages.

I wouldn't expect chocolatey.org to have a cache due to distribution rights.

It kinda do though (in a way), for all licensed products of chocolatey, a private CDN is used (can be used) to download the external files instead of downloading them directly from the original location.

dhoer commented 7 years ago

@AdmiringWorm Thanks for the feedback.

I'm not sure what you mean by this:

Anyways, why not add that nexus hosted repo as an additional source? choco source add -n=nexus -s="https://where.i.am.located/api" (a file path can also be used)

I do disable chocolatey source and add in our nexus group repo as a source. But the point I was trying to make was that it would be nice to setup a global cache and point it to a nuget hosted repo on nexus and all cache misses would be automagically be pushed to the hosted repo, which would then make it visible to the group repository configured in source.

I wrote up how I configured Nexus here: https://stackoverflow.com/a/45871332/4548096

ferventcoder commented 7 years ago

I might be a little slow here. So cache can be configured to point to a nexus hosted repo? It looks like it is for a local server where install is occurring.

For the local machine, each local machine. Some folks have attempted to set it to a share location that all machines could take advantage of but have found a race contention on creation of files there.

dhoer commented 7 years ago

I just spent time refactoring a rather large windows farm with a centralized log server that had same issue. It's best to stay away from shared drives. AWS offers SSM which could possible manage the cache, but I think it would be best to stick with a nuget approach.

ferventcoder commented 7 years ago

That is good to know, but it got me thinking that there is a code smell with this approach. Choco is touching the original source.

@dhoer No code smell, the original source is set that way due to non-redistribution. Using an external cache still has a failure point, the best and most reliable method of using Chocolatey is to ensure the package (fancy zip file) has everything in the package. This is what Package Internalizer provides. I would suggest looking closer at that functionality. https://chocolatey.org/docs/features-automatically-recompile-packages

dhoer commented 7 years ago

IMHO it is. It requires someone or something to execute that intermediary step. It is not feasible for organizations that have many teams, many internal packages, and many more shared packages to try to ensure that intermediary step is done and done properly.

The workaround for this is to require all deployments that rely on choco installs from outside the organization have the installs be baked into an AMI at the beginning of the pipeline process. This ensures that there are no broken link issues during deployments.

dhoer commented 7 years ago

@ferventcoder Look, chocolatey is a fantastic product. Thank you, thank you, thank you for building this. It was badly needed on the windows platform.

I don't want to sound rough, but I know enterprise software, and automatically-recompile-packages is not an enterprise solution. The cache method is an enterprise approach. With this approach, you could host chocolatey and guarantee that anything vetted in shared repo will have its downloads cached and not have to worry about it. This is something worth paying for. It might even open chocolatey up for partnerships with aws, gc and azure. So think big picture. Think enterprise. That is where the value is.

ferventcoder commented 7 years ago

TBH none of the Enterprise-level customers we have use the community repo. Most don't internalize or need some sort of caching because at a true Enterprise level they are building their own packages already and have staff for this level of support.

ferventcoder commented 7 years ago

However you do have some good points on caching/stashing - one point for clarification though - how is the stash not something you would need to run more than once? You mentioned

It requires someone or something to execute that intermediary step.

I'm trying to resolve in my mind how stash would not be considered the same.

The problem with a cache is that it is not deterministic - internalization is deterministic. I'm not sure why a non-deterministic feature would be considered enterprise-grade.

That said, we could consider a caching/stash feature but I'd need to understand how you foresee it as deterministic (and reliable). Making a package 100% offline and reliable is the goal for what internalizer does, would love to understand if the goal is something you would consider enterprise-grade or not and we can work from there.

dhoer commented 7 years ago

deterministic cache

service offering

open source The community would be allowed to have a private cache as well, in my case it is a nexus hosted repo, but there is no guarantee that latest choco package from chocolatey.org's url will be available.

next step Develop business plan on how to enhance your service offering to cloud vendors and how using your product will benefit infrastructure management of windows platforms.

dhoer commented 7 years ago

Other enhancements that would be nice:

  1. Have simple why to run sanity checks e.g., choco verify. Whether that is a pluggable test framework, you roll your own, or both. I like http://serverspec.org/ syntax, but not the ruby baggage that comes with it.

  2. Make it easier to jump to developer's source code on their repo. The view on the package page is nice, but I wanted to do a PR on someone's code and it was a pain to figure out where their source code was since there was no links that didn't point back to chocolatey.org.

  3. Lastly, it would be nice allow for official packages like hub.docker.com does.

ferventcoder commented 7 years ago

service offering

  • since you own chocolatey.org, anything uploaded that contains url/url64 must be able to download object successfully
  • behind the scenes you cache the download object and make it available internally upon package approval

We do this now already. Let's talk about how this is alike or different from our CDN cache - https://chocolatey.org/docs/features-private-cdn

With the download CDN cache, we already do this for folks using the community repository. The download CDN cache is targeted more at our Pro customers (individuals looking for more reliability with the community repo) and MSP customers (low interaction with keeping things up to date, open to placing more trust in a community).

When it comes to Enterprise customers and security conscious customers, they just are not going to reach out to internet resources at all. We've had conversations with hundreds of organizations, big and small, and the preferred use of Chocolatey is completely internal so they can have a trusted, repeatable process. That even includes reaching out to chocolatey.org at runtime, they just are not going to do it. We understand this. We've understood this for years, it's what has shaped our current offerings.

your private service offering would guarantee access to that internal cache, thus allowing customer to focus on their internal package and not worry internalizing 3rd party packages

Most folks use internalizer with Jenkins job(s), and it's pretty much hands off. So for most, it is set it and forget it and you get the benefit of fully internalized packages with little effort.

And that's saying that a customer is even going to reuse package logic from the community repository. Some are just right clicking on those executable installers and MSIs and selecting "Create Chocolatey Package" and they have a fully ready to go software deployment package in about 5 seconds. Pointing Package Builder to an archive of installers that an organization has would allow them to automate all of their software deployments very, very quickly.

ferventcoder commented 7 years ago

deterministic cache

  • package must use url and/or url64; otherwise, warn that not-cacheable or maybe fail if --force-cache flag set

That could be a good addition to our download cdn cache feature

  • hash the url with SHA-1

We use SHA512 I believe, SHA1 has been broken.

strip off first 8 chars and append to nuget package (name-version-8charHash) ask repo for package name-version-8charHash cache hit - download nuget package name-version-8charHash, verify checksum cache miss - download url, verify checksum, push to cache repo as name-version-8charHash continue with package installer...

Implementation details here, we are already doing this functionality for the community repo. There are some ideas here on what we can offer, and also legal understandings on reoffering our CDN for internal use we would need to ensure, but it could be a nice feature to offer.

dhoer commented 7 years ago

I recommended sha1 because this is not for security: https://stackoverflow.com/a/28792805/4548096

Some advice; don't limit chocolatey to "Package Manager for Windows" paradigm. Move to "Infrastructure Services for Windows" paradigm. This opens up chocolatey to more opportunities like security. Think if choco services were on aws and a sev 10 vulnerability was issued on a package. It would be valuable if a chocolatey service sent out alert and said what instances were vulnerable. And maybe choco had a config management service that allowed for scheduling updates to those instance. These types of services will make you a millionaire multiple times over. Btw, be sure to cut me a fat check when that happens.

Adding a simple cache for downloads and security for params is easy for someone to come along and implement. Having services mentioned above is not. I would make the cache and param security pieces freely available and focus on infrastructure services.

ferventcoder commented 7 years ago

Some advice; don't limit chocolatey to "Package Manager for Windows" paradigm. Move to "Infrastructure Services for Windows" paradigm.

@dhoer it's not limited to that paradigm. You really should learn more about "Complete Software Management for Windows"

And maybe choco had a config management service that allowed for scheduling updates to those instance.

That's the Chocolatey Central Console. Have you been to https://chocolatey.org/pricing#compare?

dhoer commented 7 years ago

@dhoer it's not limited to that paradigm. You really should learn more about "Complete Software Management for Windows"

Yeah, I don't know what that is or how to google it.

That's the Chocolatey Central Console. Have you been to https://chocolatey.org/pricing#compare?

Nice!

dhoer commented 7 years ago

We do this now already. Let's talk about how this is alike or different from our CDN cache - https://chocolatey.org/docs/features-private-cdn

Not sure how the cdn piece works. Does that require internalizing package first? If so, then cache makes internalizing step unnecessary for packages with url and url64 defined.

The internalizing step would require a build job for each package in our shop, since build servers are the only ones with keys to push to a repository. If multiple teams use the same package, who is the owner of the build? Who is allowed to update it? All this headache goes away with cache because this becomes a non-issue.

ferventcoder commented 7 years ago

Not sure how the cdn piece works. Does that require internalizing package first? If so, then cache makes internalizing step unnecessary for packages with url and url64 defined.

Nope. It just works when you install packages from the community repository.

ferventcoder commented 7 years ago

If multiple teams use the same package, who is the owner of the build? Who is allowed to update it? All this headache goes away with cache because this becomes a non-issue.

I feel like we keep going back and forth on semantics - who owns the cache? Who is allowed to update it?

And when you say teams, are you talking about development teams or ops teams?

ferventcoder commented 7 years ago

This feels like a discussion we should have in person somewhere, and then capture the results in an issue.

dhoer commented 7 years ago

I will be in San Fran this week if you want meet for coffee or something.

ferventcoder commented 7 years ago

I'm nowhere close to that area. :D

dhoer commented 7 years ago

Bottom line; If a cache was built similar to what is posted at the top but with force-cache feature added, then the steps of ensuring everything from choco is cached in house could be done in these 4 lines:

choco source disable -n=chocolatey
choco source add -n=choco-all -s "'http://repo.example.com/nexus/service/local/nuget/choco-all/'"
choco cache add -n=choco-cache -k "'redacted'" -s "'http://repo.example.com/nexus/service/local/nuget/choco-cache/'" 
choco feature enable -n=forceCache

This would be enforced on build servers. The first 2 lines would be recommended practice for developers, but if they didn't do it, not an issue.

ferventcoder commented 7 years ago

Force cache is the deterministic bit I was missing earlier.

And there is still the piece about getting new items automagically updated in the cache when they become available.

dhoer commented 7 years ago

And there is still the piece about getting new items automagically updated in the cache when they become available.

When cache feature is implemented, it should happen automatically. When choco install or upgrade is called on a package hosted on chocolatey.org that uses url/url64, it uses choco-all source defined above to determine a cache hit/miss on download url and caches on miss by pushing to choco-cache.

Private internal repos (like our nexus) would run the risk that the download url is no longer available when it tries to cache it, but the paid for private repo service hosted by chocolatey shouldn't since it would have cached it during the approval process.

ferventcoder commented 7 years ago

Private internal repos (like our nexus) would run the risk that the download url is no longer available when it tries to cache it, but the paid for private repo service hosted by chocolatey shouldn't since it would have cached it during the approval process.

One clarification that is necessary here - organizational features make perfect sense for C4B, but not always for open source. One of the benchmarks for determining where a feature falls is whether an open source user (not an organization) would find value in a feature. A user already has a cache that gets built locally automatically when they are installing packages. The good news for you is that this feature does have value, but not necessarily in open source.

dhoer commented 7 years ago

I'm not going to get time to open source this feature. ☹️

But I did have a few questions about how this will be implemented:

  1. Will the current local cache be deprecated in favor of this cache and will it be the default?
  2. Will the hash be url+checksum+version? I'm thinking about how google-chrome url stays the same but the checksum and version don't.
  3. How will packages like jdk8 be handled? I don't think jdk8 uses choco's url downloader today. This is important because Java now moves previous releases to OTN. These older releases have a different url that requires an OTN account. It would be nice to have this package refactored to ensure its cachable.
  4. How would chocolatey identify critical packages like jdk8 mentioned above?
dhoer commented 6 years ago

@ferventcoder Are you still planning to roll this out 0.10.11? When we pay for the cacheing service, will we still be able to have an internal cache? CDNs are great and all, but I have been burnt in the past by CDN misconfigurations. So it would be nice to manage the cache in-house on our Nexus artifact repo since we have total control over it.