Deterministic Builds - Githubissues

thehydroimpulse commented 10 years ago

What do you guys think of adding some sort of deterministic bundling a la a lock file? It's quite clear why npm, bower, and the rest are broken. They don't have deterministic builds. After working with Cargo (Rust's newish package manager), it's quite clear how important and useful a lock file is.

Lock files would only be committed for end apps, not for components/libraries. Now I know the npm community and such don't particularly follow semver, which is a shame, and considering you guys are just using Github as a registry, I don't know if it would be effective, since there's no stopping someone from messing with the history without bumping versions. But, I think the frontend sphere is greatly broken without these idioms.

Thoughts?

lancejpollard commented 10 years ago

Yo! Would you mind describing more what you're imagining just so we're all on the same page?

dominicbarnes commented 10 years ago

I'm not sure I follow how a lock-file means that a build is deterministic? (I thought deterministic had a much broader definition)

yields commented 10 years ago

nice idea, maybe it could be even a separate command duo lock or something.

thehydroimpulse commented 10 years ago

@lancejpollard @dominicbarnes Basically, a lock file (currently used by Bundler and Cargo) adds a new file called x.lock after resolving the dependencies. The lock file is ignored in git for libraries, but committed for apps and binaries.

Right now, you may not have the exact same version as someone on your team. Thus, it's non-deterministic.

A lock file is just a metadata file that acts like a contract of each dependency and their version (or their git commit hash). That way, every single person working on an app or binary will have all the same dependencies

For example, say you have:

var color = require('org/color');

And you grab the dependency (most-likely from master). If the dependency changes, and you rebuild, things might break. There's no contract.

A lock file automatically manages the exact version of each dependency instead of the user having to do it. Thus, in this case, you might have a lock file with:

# duo.lock
{
   "dependencies": [
      "color 0.0.1 (git+https://github.com/org/color-rs.git#bf739419e2d31050615c1ba1a395b474269a4b98)"
   ],
   "packages": [
      { 
        "name": "color", 
        "version": "0.0.1", 
        "source": "git+https://github.com/org/color-rs.git#bf739419e2d31050615c1ba1a395b474269a4b98"
      }
    ]
}

Now, this file would be committed in git with your app. When someone pulls it down, they will get the exact same dependencies and will always successfully build the app.

Npm doesn't do this and it's a huge pain! One of bower's dependencies changed and broke bower, resulting in Bower being completely broken for a day! This is quite problematic and I think a front-end system that adopts a lock file mechanism will be so much better!

thehydroimpulse commented 10 years ago

@yields That would work, but it should really be generated (if one doesn't exist already) whenever you initiate a build where you need to fetch dependencies. Now, for libraries, this would be added to your .gitignore, but apps or binaries would have it in git.

stephenmathieson commented 10 years ago

wouldn't checking in ./components/duo.json effectively do this?

lancejpollard commented 10 years ago

+1 yeah this would definitely be good to have. Another small reason, if you have a big project with a bunch of require('org/repo'), but no versions, then you may go in and add some static versions in a few of them like require('org/repo@0.1.2'), but if there are hundreds or thousands of require statements, I doubt that would happen.

If instead, you could just lock the versions and it created some lock file like this, it would be automatic.

Yeah seems like checking in that file might do it, except it has the source of every file in it, which you definitely wouldn't want as part of your git history, so it should be simplified imo.

thehydroimpulse commented 10 years ago

@stephenmathieson Versions aren't enough, especially with Github as a central registry (or git in general). You would also need the sha hash. The lock file also only contains contractual vendoring agreement because a lock file would potentially be committed in git. As a result, you want it to be super minimal and explanatory.

Moreover, it only changes when the dependencies change. The duo.json contains a bunch of other stuff in there that would result in a bloated file being committed, and it changing every build, which is not what you want.

If you want to update your dependencies, you could issue a duo update: which would drop the current lock file and create a new one, with updated dependencies.

matthewmueller commented 10 years ago

@thehydroimpulse i like the idea of more consistent builds that you can easily rollback. I haven't put enough time into thinking about how that would look but consistency is definitely a goal of duo.

I think the way docker builds images is pretty cool and I think we could incorporate some of those ideas here.

thehydroimpulse commented 10 years ago

@MatthewMueller Cool. Are you guys cloning the dependencies? The main thing would be adding git sha hashes to the metadata throughout the build process. That'll allow the possibility to lock-in based on that.

matthewmueller commented 10 years ago

Nah, we're streaming the tarballs from github. Pretty sure you can get the sha's during the resolve process: https://github.com/yields/gh-resolve/blob/master/index.js#L26

But that maybe only works when you actually have to resolve (ex. 0.0.x, not 1.0.0). Worth looking into!

naholyr commented 10 years ago

Not directly related to this issue but you say npm is broken by not having deterministic build, and then talk about a "lock file". What about npm shrinkwrap goes not cover your needs?

This is not just a rant, but it could be a good start to define the "right" way to implement this deterministic build.

callahad commented 10 years ago

What about npm shrinkwrap does not cover your needs?

Shrinkwrap only enforces version numbers, not actual package contents.

Imagine you depend on foo/bar@v1.2.3. The author of that library can push a new version, but re-use the v1.2.3 tag. All of a sudden you're installing different code than before, even though it claims to be the same version number.

naholyr commented 10 years ago

Well, that's an extremely rare case but Ok. So your lock file pointing to a git commitish would be the way to go?

theefer commented 10 years ago

Shrinkwrap only enforces version numbers, not actual package contents.

NPM doesn't allow updating the content of a release once it has been published (you have to make another release), so version numbers are equivalent to package content. Therefore, shrinkwrap does lock the code you depend on to be exactly the same. Note that it also does so for all transitive dependencies.

thehydroimpulse commented 10 years ago

There are two issues with npm shrinkrap.

It's not done by default. It's not deterministic if it's rarely done. A lock file should always be generated.
Versions don't work with git. Because duo is being built on git as a registry, you need commit hash too. Versions mean nothing because you can easily overwrite and change the history. Thus, you need to lock to a particular hash.

theefer commented 10 years ago

True it's not by default, but other systems that provide a lock mechanism (e.g. Bundler) also leave it optional. Having it available, even as an option, is indispensable though.

And yes Git lets you rewrite history, we were just talking about the case of NPM, which doesn't. For Git, you'd need hashes, which isn't really the nicest way of tracking versions, is it...?

callahad commented 10 years ago

NPM doesn't allow updating the content of a release once it has been published (you have to make another release), so version numbers are equivalent to package content.

That puts a lot of faith in NPM. Storing content or commit hashes means you can determine whether or not what you get really is what you expect. "Trust, but verify."

The Mozilla Persona team actually caught a modified-and-republished package on NPM last year, so it absolutely happens. I can dig up the commit for reference if you're curious. We detected the change thanks to our use of npm-lockdown.

...but that's besides the point. Duo uses GitHub, not NPM, and Git tags are mutable. :)

theefer commented 10 years ago

It may be because immutable NPM releases only got introduced early this year, see https://github.com/npm/npm-registry-couchapp/issues/148

But yes, that's beside the point for Duo.

matthewmueller commented 10 years ago

So after chatting a bit with @guille, I think the right approach will be to save the sha in our cache, which is in located in components/duo.json.

From there we can choose at any time to build from the sha or resolve and fetch the latest version that satisfies semver.

Any thoughts on that?

thehydroimpulse commented 10 years ago

@MatthewMueller I think you'd need a separate file mainly because you'll likely be committing it in git. The existing cache file has some other metadata that would add too much noise.

ianstormtaylor commented 10 years ago

Something like:

component.json

{
  "name": "my-component",
  "dependencies": {
    "component/emitter": "*",
    "matthewmueller/uid": "*"
  }
}

component.lock.json

{
  "name": "my-component",
  "dependencies": {
    "component/emitter": "27f3d7cd66e1170cba54e6715bfe0e0c148843f2",
    "matthewmueller/uid": "923bfb02fbecc65690120eed7dbff52bc61d9f1f"
  } 
}

One issue I see is that it means we can only depend on external sources that provide commit hashes, which means that having require('npmjs.org/to-camel-case') in the future would be impossible?

thehydroimpulse commented 10 years ago

You would essentially have both a version and hash for each git dependency. Other sources can simply have a version. On Aug 22, 2014 12:56 PM, "Ian Storm Taylor" notifications@github.com wrote:

Something like: component.json

{ "name": "my-component", "dependencies": { "component/emitter": "", "matthewmueller/uid": "" }}

component.lock.json

{ "name": "my-component", "dependencies": { "component/emitter": "27f3d7cd66e1170cba54e6715bfe0e0c148843f2", "matthewmueller/uid": "923bfb02fbecc65690120eed7dbff52bc61d9f1f" } }

One issue I see is that it means we can only depend on external sources that provide commit hashes, which means that having require(' npmjs.org/to-camel-case') in the future would be impossible?

— Reply to this email directly or view it on GitHub https://github.com/duojs/duo/issues/220#issuecomment-53105903.

matthewmueller commented 10 years ago

Do you think component authors should have a way to create these deterministic builds? So if I download say component/tip that's pinned/locked, should it follow the lock's guidelines? Or do you think consumption is not as important?

lancejpollard commented 10 years ago

Just wanted to throw out another practical example of why something like this will be vital to have.

At segment.io we have over 500 npm modules and components, maybe more than 1000, not to mention the many external open source components/modules we use. Each package.json and component.json has let's say an average of 10 dependencies (many have just 2-5, many have 20+). The "aggregation" projects (larger modules, or apps) have dependencies that are constantly being updated.

The way that it is now, if you have this sort of aggregation component, such as this:

https://github.com/segmentio/analytics.js-integrations/blob/master/component.json https://github.com/segmentio/analytics.js-integrations/blob/master/package.json

And each of those dependencies have their own component.json/package.json, which have their own... We regularly have to update 3 or 4 projects every time there is a change to one. We are able to only have to update 3 or 4 because we don't have hardcoded versions like 1.3.2 on every single dependency, which would be insane. If we hardcoded every dependency, then we would move at a snail's pace. Just to update a library like component/each to the latest, we would have to go through every dependency in the tree on analytics.js-integrations and update it, git commit, git release, npm publish, etc., which would make it so the simplest change takes hours or days to make.

Instead, we want to be able to make these changes quickly. So we end up compromising and doing 1.3.x. This means we can make a change to some nested dependency, and just bump the patch, and we don't have to update everything that depends on that component (and git release, npm release, etc.).

But it's still not ideal, because we aren't perfect and sometimes something slips through the cracks, and a change to some nested dependency breaks everything, and it takes hours or sometimes days to figure it out and fix it so we can continue making progress. While we're figuring this out we're basically blocked, but since there's always more to do we are able to switch tasks and start on some other unrelated thing, meanwhile racking our brain on what could possibly have changed, until our brain can finally connect the dots and figure it out.

This is a big time sink, having to figure out these bugs that pop up because the versions aren't locked.

Instead of having this:

fixed versions in every single dependency in your dependency tree, which is prohibitively expensive, making the most trivial of changes take hours or days (because you have to re-release/publish everything that depends on it)
relaxed versions like 1.3.x, which decrease the amount of time a small change takes, but creates room for random bugs to block you for hours or days as well

Instead of having that, having a lock would solve this, and the workflow would be more like:

For the apps, or "aggregation" components (or whatever you want to call them), lock all of the dependencies to fixed versions like 1.3.2 (or using the git hashes, whatever it turns out you guys think is best).
Then, have space allocated in your workflow to consciously/manually upgrade a specific dependency. This will take a good amount of effort to do, but it will mean for the bulk of the work you do on a daily basis, you won't ever run into random (and difficult to resolve) bugs that sneak into the code because the versions weren't locked.

Then, when you do want to upgrade a dependency, you won't have to go through dozens or hundreds of components and change the version number in component.json from 1.3.2 to 1.3.3, and have to then git commit, git push, npm publish, etc. for every component that depends on those, saving you a lot of time when you actually want to make the upgrade.

So, the lock file will be a huge win-win.

matthewmueller commented 10 years ago

So it seems like we're talking about an shrinkwrap.json with sha's instead of versions. Since its going to get committed anyway. I never really liked having this extra manifest, can we come up with anything that's better but still unobtrusive?

matthewmueller commented 10 years ago

@lancejpollard yah, I totally share this pain. I spent 2 weeks trying to upgrade Cloudup to component but eventually gave up, because it was such a mess.

I'd like the ability to pin and unpin or lock and unlock at will. So when today is an "upgrade" day, you unlock, upgrade your deps, deal with any pain, and relock. Then hopefully dependency issues wouldn't crop up while you're doing other things.

Gozala commented 10 years ago

I don't think doing it only at the app level is good enough let me explain why. Typically in npm I use specific versions in my libraries but dependencies of my library not always do. Unfortunately that means that my library can get broken if dependency of my dependency is published and falls into a range specified by my dependency I have no control of.

So I do think .lock file (or maybe better .duo) file should always be generated and maintained by duo regardless weather it's library or an app.

I also would argue that allowing only versioned requirements is a way to go as it that makes it clear by just looking at the code what has being imported otherwise I would have to scan through all the .lock files to figure out what actually has being imported. Not to mention that if my two dependencies require different version of library but do so via non version-ed require it would not be possible to implement node support #7 without getting into recursive structure vs current flat structure.

Gozala commented 10 years ago

So while #235 is related to this it is not necessary the same as you can have either or both of this in place.

ianstormtaylor commented 10 years ago

How does .locks for public libraries work when they conflict with the lock of your application? For example, the public library has locked on component/events@1.1.2 resolved from component/events@1.1.x, but in my application a dependency has component/events@^1.1.3. I'd want to be able to not have to duplicate that dependency just to be able to use it.

Seems like only the top level should have a lock, but that lock should be nested like npm's shrinkwrap is, locking all the way down the dependency tree?

On Fri, Aug 22, 2014 at 1:17 PM, Irakli Gozalishvili < notifications@github.com> wrote:

So while #235 https://github.com/duojs/duo/issues/235 is related to this it is not necessary the same as you can have either or both of this in place.

— Reply to this email directly or view it on GitHub https://github.com/duojs/duo/issues/220#issuecomment-53114953.

thehydroimpulse commented 10 years ago

In terms of the metadata, there are two solutions:

Git: Store both the version and hash
Other (such as npm): Store the version (considering it's immutable in npm's case)

You could have a duo.lock or whatever:

{
    "package": "foobar",
    "version": "0.1.0",
    "dependencies": [
      "depA 2.4.6 (git+https://github.com/org/depB.git#d4ad24a1eb07c68e324185a1441a43283dad5f2d)",
      "depB 1.2.9 (git+https://github.com/org/depB.git#3ca4197944f9e59436fd2a7e4ef76e76b559ff64)"
    ],
    "packages": [
      {
        "name": "depA",
        "version": "2.4.6",
        "source": "git+https://github.com/org/depB.git#d4ad24a1eb07c68e324185a1441a43283dad5f2d"
      },
      {
        "name": "depB",
        "version": "1.2.9",
        "source": "git+https://github.com/org/depB.git#3ca4197944f9e59436fd2a7e4ef76e76b559ff64"
      }
    ]
}

thehydroimpulse commented 10 years ago

To get a better idea on what is being locked, check out a sample Cargo.lock https://github.com/PistonDevelopers/hematite/blob/master/Cargo.lock

theefer commented 10 years ago

FYI using version locks for apps but not libraries is also the recommended way in Ruby (Gemfile, Gemfile.lock), for the reasons evoked above.

Gozala commented 10 years ago

Locks in libraries is not a problem, problem is that require("foo/bar") may mean different things based of the lockfile. So why not solve it by just making version mandatory then the problem is solved.

Gozala commented 10 years ago

How does .locks for public libraries work when they conflict with the lock of your application? For example, the public library has locked on component/events@1.1.2 resolved from component/events@1.1.x, but in my application a dependency has component/events@^1.1.3. I'd want to be able to not have to duplicate that dependency just to be able to use it.

That is not problem of locking that is a problem of overriding what your dependencies require based of some constraints (like I don't want two version of same library). Which is reasonable requirement but I think that's a totally different feature and should probably be added separately as well.

lancejpollard commented 10 years ago

@Gozala the problem with having the version mandatory for every component/module is like I was expressing above. It means that a small bump in a patch to some deeply nested component means you have to manually update every single thing in the chain, which is a big time sink. Not only that, it is easy to miss some place where that dependency was used, so now you have 2 versions being used, which may be harmless or might be confusing to debug as well, which means more complication and time wasted. What is your experience with lock files so far?

The lock file will just save the current versions of everything, which means that

problem is that require("foo/bar") may mean different things based of the lockfile

doesn't happen. After you've installed everything, then the lock file is created, then the requires will work like they did the first time every time. It keeps track of what that require('foo/bar') means relative to the file it's in. It may be in a component using version 0.1.2 of foo/bar, while another place with require('foo/bar') may be using 0.3.7, so it would still work. Maybe I'm misunderstanding what you're imagining? If so, could you add some example code of what you are imagining a workflow would look like?

Gozala commented 10 years ago

@Gozala the problem with having the version mandatory for every component/module is like I was expressing above. It means that a small bump in a patch to some deeply nested component means you have to manually update every single thing in the chain, which is a big time sink.

Yes I do agree that is a tradeoff of this approach although there is no reason why we could not improve tools (like duo) to assist us with this so it's not such a big time sink.

The lock file will just save the current versions of everything, which means that

problem is that require("foo/bar") may mean different things based of the lockfile

doesn't happen. After you've installed everything, then the lock file is created, then the requires will work like they did the first time every time. Maybe I'm misunderstanding what you're imagining?

I think you imply that locking also does overriding while I do not. For example I think it's reasonable if two of my dependencies will end up depending on different version of foo/bar. While it is useful to have a way of overriding their choices I do not think that should be implied by locking.

thehydroimpulse commented 10 years ago

Note that once you generate a new lock, it's the source of truth. Meaning, as long as there's a duo.lock (whatever it's named), it'll be used to determine which version of each dependency to build. The generation of the lock file would likely happen after the first resolve. That first resolve would take semver into account and do dependency resolving to try and minimize duplicate versions of the same library. NPM does the complete opposite. It always grabs duplicates, since it's not a big deal. Bundler, on the other hand, only grabs a single version of a library (if I'm not mistaken). Cargo tries to be in the middle of the two where it'll try and minimize duplication, but if that's not possible, it'll include duplicates. I think duo would be correct to assume the same position as Cargo's.

To do another resolve phase, you'd either delete the lock file or you can run a duo unlock command. That'll start the whole process over again and re-lock everything.

Edit: I should also add that at the app/executable level, everything is locked. That's crucial to ensure a deterministic build process.

lancejpollard commented 10 years ago

@Gozala

Yes I do agree that is a tradeoff of this approach although there is no reason why we could not improve tools (like duo) to assist us with this so it's not such a big time sink.

Yes this would be very cool. Do you have any specific thoughts on how this might work? We have thought some about doing stuff like this, but it is potentially very complicated and would end up not being used.

For example, you could have a tool that automatically bumps all of the versions of every related repo to the new patch. Seems straightforward. But what if you want to run tests for each repo before building, and some test fails? You would then have to know where exactly this "automated" system left off, and do it manually from there, which makes it kind of confusing because you don't know exactly where you should start, or if things are in a weird state. Also, maybe you want to keep other repos locked on the old version, so now you don't want everything to auto-update and auto-push.

Or maybe instead, there is a tool that lists every place that dependency is used, and which version it is. Actually I think duo does this with duo ls. That makes it easy to at least know where you should be updating stuff. But still, if versions were locked, you would have to go to potentially 10 or 20 places and manually update them. So maybe you write a script to automate this, that cds into each repo, make test ; git release ; npm pubish..., which could work but seems error prone.

Either way, would love to find better approaches, but we have tried specifying specific versions for everything, and with a large amount of repos/components, it is unfeasible. Every small change takes a few hours, and when you have to constantly make small changes like this, you get to the point where you aren't actually working on the product anymore, and are instead doing this manual labor constantly. And it's one of those things that if you tried to automate, you'd spend lots of time building a system and then it would never quite work right. So if it were me I would not go down that path.

Gozala commented 10 years ago

I guess there could be something like duo.dependencies file that would be a tree of dependencies an versions, that you could go tweak as pleased and then call something like duo update to reflect your tweaks maybe ?

thehydroimpulse commented 10 years ago

@Gozala That would be equivalent to the lock file except with the lock file, you never touch it yourself. If you update a dependency in your component.json, the next time duo runs it could refresh the lock file, which you would later commit.

Gozala commented 10 years ago

What I think would be the best approach and maybe I am missing something crucial here:

Make versions in requirements mandatory so by looking at module it's clear what is benig required. We could argue that libraries may point same tag to a diff commit, but I that is unlikely and if you happen to deal with such library just use hash or clone it and depend on your clone.
At the application level there should be some kind of configuration file where you could override override versions. For example you could say foo/bar@1.0 = foo/bar@2.0 that way anything depending on 1.0 version will still get 2.0.

lancejpollard commented 10 years ago

Here is another example of a lock file:

https://github.com/kitematic/kitematic/blob/master/meteor/smart.lock

lancejpollard commented 10 years ago

Has anyone started on this? If not I may if that works.

thehydroimpulse commented 10 years ago

@lancejpollard I played around a little bit with getting a lock file implemented but didn't have any time to do much with it, so I'm not currently working on an implementation right now.

matthewmueller commented 10 years ago

Sweet! It'd be cool to implement this as a plugin during the resolution phase. I've been playing around with adding hooks in another library. Maybe we could add something like this:

duo.before('resolve', function(file, entry) {
  // rewrite require(...) or @import
});

duo.before('parse', function(file, entry) {
  // do what plugins do now.
});

duo.after('parse', function(file, entry) {
  // some post processing or cleanup
});

With the recent updates to ware we should be able to do this pretty easily.

matthewmueller commented 10 years ago

More fluent:

duo
  .before()
    .resolving(fn)
    .resolve(fn)
  .after()
    .parse(fn)

matthewmueller commented 10 years ago

cc/ @ianstormtaylor, we had similar ideas on models.

duojs / duo

Deterministic Builds #220

component.json

component.lock.json