JuliaLang / Juleps

Julia Enhancement Proposals
Other
67 stars 24 forks source link

Pkg3: immutability of compatibility #14

Open StefanKarpinski opened 7 years ago

StefanKarpinski commented 7 years ago

Continuing half of the discussion on https://github.com/JuliaLang/Juleps/issues/3.

tbreloff commented 7 years ago

This is no longer true in Pkg3. By introducing multiple registries, some of which are private, the registration system necessarily becomes distributed, federated and not globally visible

For me, this statement is the key reason that dependency info should be independent in concept from the source repo. Dependency resolution will be determined by one or more (possibly conflicting) registries. In the end, I don't think it's ever a robust solution to use the deps from the package repo... it should be completely determined by the registries. Now, keeping a copy of dependency info for initialization of a registry file is a mere convenience, and is orthogonal to whether repo code and dependency info are independent concepts.

tkelman commented 7 years ago
  1. When a package’s source is checked out somewhere, we should apply any registry updates to its config file.

Why? You haven't given any reason for doing this or problem that it solves, and 3-5 flow from this fairly weak motivation.

  1. It’s confusing for the checked out source version of a package to say one thing about compatibility while the registry for the package says something else.

I disagree that this is all that confusing. They are different information by way of past package versions being immutable, and registries not.

If config.toml is only the claim, at tagging time, about compatibility, then sure it can be an immutable part of a source release's content. But you shouldn't use the claim at tagging time indefinitely as the source of this information. That would necessitate making a new source release for any change to its content, and if that process allows replacement of anything other than the compatibility info in config.toml, then it makes the primary job of the package manager, ensuring previously published releases can be depended on indefinitely, unreliable all around. This is my core objection to this, unconstrained replacement of previously published versions should not be a designed-in feature that's up to manual review in a registry to enforce.

StefanKarpinski commented 7 years ago

version tagging will flow from the registry to the source repo, not the other way around

How do you envision that this would work?

Request registration, which can be rejected or accepted. Once a version is registered, then do the tagging – this is one place where the registry having commit access to the repo would be handy. Unfortunately, there are no pull requests for tags. This could be part of future PkgDev functionality.

StefanKarpinski commented 7 years ago

This is no longer true in Pkg3. By introducing multiple registries, some of which are private, the registration system necessarily becomes distributed, federated and not globally visible

For me, this statement is the key reason that dependency info should be independent in concept from the source repo. Dependency resolution will be determined by one or more (possibly conflicting) registries. In the end, I don't think it's ever a robust solution to use the deps from the package repo... it should be completely determined by the registries. Now, keeping a copy of dependency info for initialization of a registry file is a mere convenience, and is orthogonal to whether repo code and dependency info are independent concepts.

So different registries could have completely different notions of what the version numbers and associated commits of a package are? What do you do when registries disagree? How do you reconcile this? Completely external versioning from arbitrarily many federated authorities would be total chaos. There has to be an authoritative source for each package. The obvious place for that is in the package repo itself.

tbreloff commented 7 years ago

Completely external versioning from arbitrarily many federated authorities would be total chaos

What about the very realistic scenario that an organization wants to define specific versions/deps which are not public, some of which overlap with a JuliaLang registry. They could: 1) fork and fix all repos that disagree with their preferred dependency resolution, or 2) use an alternative package manager. Why not try to support 3) Override the dependencies.

I'm not saying "no dependency info allowed in the package repo"... I'm saying that the package repo should not be the definitive source... registries should take precedence. And since registries can take precedence, it's not required to keep deps info in the package repo.

StefanKarpinski commented 7 years ago
  1. It’s confusing for the checked out source version of a package to say one thing about compatibility while the registry for the package says something else.

I disagree that this is all that confusing. They are different information by way of past package versions being immutable, and registries not.

  1. Innumerable conversations with people who are confused about this.
  2. The fact that there are complex rules about whether the METADATA requires file applies or source REQUIRE file applies. I wrote them and I don't remember what they are. Quick – don't look at the Pkg source code and tell me what the rules are.
  3. It's fairly obvious that having two possible sources for a fact is more complicated and confusing than only having a single possible source for it.
  1. When a package’s source is checked out somewhere, we should apply any registry updates to its config file.

Why? You haven't given any reason for doing this or problem that it solves, and 3-5 flow from this fairly weak motivation.

  1. Because of the above confusion.
  2. So that we don't need complex logic in the package manager to decide which applies.
  3. Because one generally wants to include those changes in new versions of the package.
  4. If one doesn't want to include those changes that fact is of interest – why doesn't the change to compatibility apply to downstream versions?

If config.toml is only the claim, at tagging time, about compatibility, then sure it can be an immutable part of a source release's content. But you shouldn't use the claim at tagging time indefinitely as the source of this information.

It's not indefinite – you can make a new compatibility release to update it. We can even allow making those updates in a registry without having to make a source version first since the source version can then be automatically made in the package repo. The point is that the tagged commit should exist at some point.

That would necessitate making a new source release for any change to its content, and if that process allows replacement of anything other than the compatibility info in config.toml, then it makes the primary job of the package manage, ensuring previously published releases can be depended on indefinitely, unreliable all around. This is my core objection to this, unconstrained replacement of previously published versions should not be a designed-in feature that's up to manual review in a registry to enforce.

I've said multiple times that the replacement need not be unconstrained. It's trivial to verify automatically that compatibility updates only make changes to Config.toml. What's the problem with that? How does this make anything unreliable? Older releases don't go away. If you're already using them, they aren't automatically changed or deleted. They are simply shadowed when looking for new versions to use. If v1.2.3+1 exists with updated compatibility claims, the package manager will consider that instead of v1.2.3 which has older, no-longer valid claims. I'm not sure what you're imagining here, but it doesn't reflect what I've said.

StefanKarpinski commented 7 years ago

What about the very realistic scenario that an organization wants to define specific versions/deps which are not public, some of which overlap with a JuliaLang registry. They could: 1) fork and fix all repos that disagree with their preferred dependency resolution, or 2) use an alternative package manager. Why not try to support 3) Override the dependencies.

Yes, this is definitely a consideration. I'm considering a naming convention for private versions that will ensure that they don't conflict with public versions. That's what I was getting at with the v1.2.3+hotfix version above. If we disallow build strings aside from bare integers (for compatibility-only updates) in public repos, then this would be guaranteed not to clash with any public version names.

I'm not saying "no dependency info allowed in the package repo"... I'm saying that the package repo should not be the definitive source... registries should take precedence. And since registries can take precedence, it's not required to keep deps info in the package repo.

I see what you're getting at, I think. That sometimes – e.g. in case like the above hotfix scenario – you want to make a registry update to dependencies and then let that flow back to the source repo. But the source repo is itself distributed, being a git repository. In that scenario, the company has their own copy of the package where the forked dependency info lives. Their private registry is just so that they can communicate that version internally. There needs to be support for alternate repository sources for situations like that so that you can find the relevant fork.

tkelman commented 7 years ago

I believe the current rule is that metadata is used when the local copy of a registered package is at a release tag, and the package's REQUIRE is used for unregistered packages or when a package is checked out to a branch or has local modifications. The latter scenario is ruled out by the immutability of installed packages design here. I've made a proposal to get rid of the unregistered special case.

The point is that the tagged commit should exist at some point

It would be useful some of the time, but I don't think it should be required - a compatibility update doesn't absolutely need to have its own independent identity, it is derived from an existing release.

Automated registry level enforcement could work, but that imposes a downstream cost on anyone who wants to maintain their own registry, they'd need to reproduce all that automation to ensure correctness.

StefanKarpinski commented 7 years ago

Automated registry level enforcement could work, but that imposes a downstream cost on anyone who wants to maintain their own registry, they'd need to reproduce all that automation to ensure correctness.

I think this needs to happen no matter what. Having you personally check every registration request does not scale and we already have a lot of things we need to check when a new version is registered. There will only be more things to check in the future.

tkelman commented 7 years ago

If you require a commit to exist in the package repo for a compatibility update and refer to that commit as the source of that version that gets downloaded, then you have to wait for a possibly non responsive package author to act to fix compatibility issues with their package. Or redirect to a fork any time this happens.

We want more automation for testing and auditing in the public registry, but we also want it to be easy and not require too much infrastructure to maintain a separate one.

StefanKarpinski commented 7 years ago

Having multiple sources for packages means that we can have official public forks and organizations can have internal private forks that are checked for commits all the time – no redirection needed. I don't think that comparing two source trees in git is particularly onerous for automation.

simonbyrne commented 7 years ago

Request registration, which can be rejected or accepted. Once a version is registered, then do the tagging – this is one place where the registry having commit access to the repo would be handy. Unfortunately, there are no pull requests for tags. This could be part of future PkgDev functionality.

The only realistic way I could see this working is that the registry itself maintain a fork of all the repositories, and point to those instead: releases could then be git tags which are signed by the registry.

This may also address Tony's concerns, in that the registry maintainers can then push updates to REQUIRE files in the fork, without any input required by package author. It would also address the problem of package authors deleting their repos in a fit of spite, a la NPM's leftpad problem.

StefanKarpinski commented 7 years ago

@simonbyrne: yes, this is probably a good idea. Registry-signed tags make sense too.

tbreloff commented 7 years ago

registry itself maintain a fork of all the repositories

The other benefit is that the community could decide to tag/release without requiring the package author. There have been many times that people would have stepped up and tagged something while the author is on vacation (or whatever).

simonbyrne commented 7 years ago

So as I understand it, a typical release process might look something like:

  1. Package author requests new release via some registry API
  2. Registry performs checks. If it fails we notify the author somehow
  3. Registry pulls data into its fork, tags and signs the tag.
  4. Registry contents are updated.
  5. All dependent packages are also checked for compatibility with the new package: their Config.toml files are updated to reflect the outcomes of this check.

Is that what you had in mind?

(these points are intentionally a bit vague, in particular point 5, but that is probably best discussed in a different issue)

tbreloff commented 7 years ago

That sounds pretty reasonable @simonbyrne. And my point above was that "Package author requests new release" could just as easily be "community requests new release" without any hiccups (with the social understanding that we should default to the author's wishes whenever feasible).

StefanKarpinski commented 7 years ago

Yes, roughly, although I might order it like this instead:

  1. Package author requests new release via some registry API
  2. Registry pulls git data into its fork
  3. Registry performs checks. If it fails we notify the author somehow
  4. Registry tags and signs the tag
  5. Registry contents are updated

One issue with tagging is that IIRC, tags are only transmitted via push/pull, not via pull request, so it's still unclear how to get the tag into the origin repo. For GitHub repos, we could use the tag create API but that doesn't address non-GitHub repos. For those, I suppose we could either have platform-specific APIs or ask the repository owners to pull tags from the registry fork.

I'm also not sure where the best point for checking compatibility is. It could be part of the checks step – if it's a patch release, it shouldn't break any packages that depend on it. We could verify that before accepting a version.

StefanKarpinski commented 7 years ago

Also, note that git tags are usually for commits not trees, so if we use tree tags (which is possible), it will be a bit unusual. We may want to tag a commit for convenience but associate the version with a tree rather than a commit.

tkelman commented 7 years ago

If the checks fail, you'd need to back out pulling into the registry fork and redo it after the author addresses the issues.

This is getting to be a lot of machinery to expect small organizations to maintain their own instances of.

StefanKarpinski commented 7 years ago

Why would you need to back anything out? Git commits are immutable.

tkelman commented 7 years ago

Not everyone has enabled branch protection - people do occasionally force push to master of packages. They shouldn't be doing that, but if they do we wouldn't want it to mess up the registry's fork.

StefanKarpinski commented 7 years ago

Force pushing a branch doesn't destroy commits, it just changes the commit that a branch points at.

tkelman commented 7 years ago

Depends exactly what "pulls git data into its fork" means then, and where the checks happen. If checks happen in a completely from-scratch clone wherever it's running and don't push anything back to the github copy of the fork unless the checks pass, then it's fine. Pulling into an existing clone's master after a force push is where things can go wrong.

StefanKarpinski commented 7 years ago

I will make the wrong choice so that we can argue about it.

simonbyrne commented 7 years ago

Pulling into an existing clone's master after a force push is where things can go wrong.

I think "pull" may be the wrong word here: the metadata fork I envision the process as something like the following:

git fetch upstream
git checkout HASH
# run tests
# if tests pass
git tag -s -m "..." vX.Y.Z
git push registry vX.Y.Z

(here upstream and registry are the respective remotes). In other words, no branches are involved. This doesn't solve the problem of getting the tags back to upstream, but I don't know if that is such a big deal as the user won't be pulling from it.

I'm not sure about the commit vs tree hash issue, but my experience has been that trees are often harder to work with as they're not really a "user facing" feature of git.

Also, I'm not really sure how we would handle non-git sources either.

simonbyrne commented 7 years ago

one other thing to think about: who "owns" the version numbers. In what I outlined above, it would be the registry, not the package (as emphasised by the fact that it is the registry signing the tag).

I'm not sure how this would work in the case of a package being in multiple registries (who decides whether or not it is a valid version?)

tkelman commented 7 years ago

I will make the wrong choice so that we can argue about it.

Was that really necessary? "This sort of response is not constructive" either.

It's fairly obvious that having two possible sources for a fact is more complicated and confusing than only having a single possible source for it.

We haven't actually solved this problem if everything is duplicated in both the registry and the package. One should take priority over the other. If we design this whole system to ensure they're equal in most normal usage, you still need to pick which to use in case of local divergence or development. Local development probably points to preferring the package's copy, but how local development is supposed to fit with the rest of Pkg3 has not yet been described here.

One of the copies of this information is a duplicate and somewhat redundant. It sounds like we're moving towards a very registry-driven design. In use cases other than local development, the package's copy (and upstreaming registry-driven compatibility changes back to it) is fairly vestigial. You want to be able to do dependency resolution without having to first download every version of every package. How would version resolution work on an unregistered package? Right now, unregistered packages have no versions - how would Pkg3 change that?

Archiving past versions is a good idea, but doing so by having every registry also maintain git forks of all its packages is making our "github as cdn" abuse worse.

StefanKarpinski commented 7 years ago

Yeah, tagging versions is complicated. We may need a "two phase commit" process.

StefanKarpinski commented 7 years ago

I will make the wrong choice so that we can argue about it.

Was that really necessary? "This sort of response is not constructive" either.

My point is that your attitude to this discussion has been fundamentally uncharitable and contentious. In this particular instance, there are two ways to do a thing, and instead of giving me the benefit of the doubt that I'm not a moron and will pick the one that works, you assume that I'll do the wrong thing and then argue with me based on that assumption. This attitude is frustrating, comes across as disrespectful, and mires us in unnecessary arguments instead of collaborative exploration of the solution space to find something that addresses everyone's concerns.

We haven't actually solved this problem if everything is duplicated in both the registry and the package. One should take priority over the other.

Replicating immutable data isn't a problem. That's the principle behind git and most other successful distributed data stores. Having multiple copies is only a problem if they are mutable.

It sounds like we're moving towards a very registry-driven design.

Quite the opposite. If anything, the package repository is primary and registries are just copies of immutable, append-only metadata about package versions, copied from the packages.

How would version resolution work on an unregistered package? Right now, unregistered packages have no versions - how would Pkg3 change that?

This is a good question. I was considering just using tags for versions in unregistered packages. But of course, you generally don't want to bother tagging versions if your package isn't registered, so I'm not sure what the point is. Instead, I think one would just use an environment file in the git repo to synchronize unregistered packages in lock-step (a la MetaPkg), but their dependencies on registered packages can be looser via compatibility constraints in the unregistered package repos.

Archiving past versions is a good idea, but doing so by having every registry also maintain git forks of all packages is making our "github as cdn" abuse worse.

How else would you do this? If you want to keep an archive of a package's git history you have to make a fork of it in case it goes away at some point. Using git for source delivery has problems, but that's an orthogonal issue.

StefanKarpinski commented 7 years ago

Maybe we should separate the two jobs of a registry:

  1. Validation: checking that a proposed version makes sense – that it satisfies various checks.
  2. Collection: keeping package and version metadata in a centralized location.

The former is the part that requires intelligence and automation while the latter is dead simple.

tbreloff commented 7 years ago

And don't forget #3: user api. Make it dirt-simple for everyone involved to follow best practices... So then they might.

I agree these can be designed separately.

On Tue, Nov 22, 2016 at 6:27 PM Stefan Karpinski notifications@github.com wrote:

Maybe we should separate the two jobs of a registry:

  1. Validation: checking that a proposed version makes sense – that it satisfies certain requirements and checks.
  2. Collection: keeping package and version metadata in a centralized location.

The former is the part that requires intelligence and automation while the latter is dead simple.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaLang/Juleps/issues/14#issuecomment-262396927, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492sI7is9lNIJdKJULbLvDqneBII-Pks5rA3p-gaJpZM4Kyxf_ .

tkelman commented 7 years ago

There are many more than 2 ways to do something that is "intentionally a bit vague" and unclearly specified. I've been contentiously arguing against aspects of the design that I don't think will work. Several of which it looks like we've moved away from, but it took discussion. Take it at technical face value, please.

Dependency resolution can require global information, which is why registries contain compatibility information for all past versions. Getting the equivalent set of information if the package copy is the primary source would require either downloading all versions, or getting information out of git for many versions simultaneously in a way that we don't currently do anywhere to my knowledge. The latter would make the goal of allowing packages to not have to be git repositories less feasible.

If we're only archiving releases that get published to a registry, then why would the git history be needed? If packages are immutable after installation then they can just be source tarballs, and an archive can work like most conventional package managers, just a collection of source release snapshots.

StefanKarpinski commented 7 years ago

I was actually thinking of separating them entirely. I.e. first you submit a proposed version to various validation services: services that check things like that the proposed version metadata is well-formed, that its tests pass, that it works with various versions of its dependencies, that it doesn't break various versions of its dependents. Once you've got ok/error from a validation service or services, you can go to a registry and submit that and then the check at the registry is just that the sufficient set of validations have passed. I can even imagine private packages being submitted to cloud-hosted validations services and then registered privately. The set of validations that a version has passed can be attributes of the version; people can filter packages/versions based on validations that it has.

StefanKarpinski commented 7 years ago

If we're only archiving releases that get published to a registry, then why would the git history be needed? If packages are immutable after installation then they can just be source tarballs, and an archive can work like most conventional package managers, just a collection of source release snapshots.

If someone deletes their git repo, we want to be able to make another full git repo the new source of the package. We need a fork to do that. I'm not sure why you're arguing this point.

StefanKarpinski commented 7 years ago

I'm not sure what your point about global version information is.

tkelman commented 7 years ago

Don't we also want to make Pkg3 robust against the "package developer force pushed over master" scenario? So tags need not all be linear have common descendants? We'd want it to be possible to restart development from a non-git copy of a deleted repo with a fresh git init from scratch, wouldn't we? (Or the "rebased to remove large old history" situation that has come up a few times.)

The scheme of propagating tags through forks sounds overly complex and unnecessary, and a lot to set up to run a registry. And now we have multiple mutable remotes for any given package - this could get confusing in terms of issue and PR management, if all the downloads are coming from a fork that users should actually ignore.

The point about global version information is that the head copy of a package's compatibility contains less information than the registry's copy. Except for the author at tag time, everyone else could delete the package's copy and not notice. "Package is primary" is the remaining item of dispute here, afaict.

StefanKarpinski commented 7 years ago

I agree that propagating tags through forks is complicated and maybe impractical. We'll have to see. The main thing we need is copies of the git history for the commits behind various tagged versions, but that could be a separate process from registration.

tkelman commented 7 years ago

If we have a reliable registry-controlled mechanism of obtaining a copy of the release snapshot source with a matching checksum, does it actually need a copy of the git history? Thanks to github it's oddly easier to get straightforward hosting of a full git repo (up to its size limits, anyway) than it is to host arbitrary non-git source snapshots, but I wonder whether we're letting that ease of use drive the design decisions.

martinholters commented 7 years ago

Wouldn't future support of non-git-based packages be problematic if releasing a version would include cloning its git history? Ok, of course one could replace that with "cloning its version history in whatever VCS is being used", but that would make registries much more complicated, as they would have to accommodate every VCS used by packages they want to register.

tkelman commented 7 years ago

We should move this aspect of discussion to its own issue, but I think it's totally reasonable today to require that Julia packages must have a git repo (or git mirror of something else) as the development source of record. What we should try to keep feasible is allowing the flexibility of downloading release tags at install time to users' systems in a form other than a full git clone though.

simonbyrne commented 7 years ago

Archiving past versions is a good idea, but doing so by having every registry also maintain git forks of all packages is making our "github as cdn" abuse worse.

As I understand it, GitHub is fairly intelligent about not unnecessarily replicating data across forks (thanks to git's immutable objects), so I don't think this is really an issue.