TritonDataCenter / rfd

Requests for Discussion
Mozilla Public License 2.0
259 stars 79 forks source link

RFD 145: Lullaby 3: Improving the Triton/Manta builds #105

Open timfoster opened 6 years ago

timfoster commented 6 years ago

This is for discussion of RFD 145

https://github.com/joyent/rfd/blob/master/rfd/0145/README.md

jclulow commented 6 years ago

If we do switch to using eng.git as a submodule, we should stick it in the same place that we put other submodules; i.e., deps/eng in the repo. I did try doing this (a long time ago now) in medusa, where eng.git is a submodule. Instead of using a simple variable to ensure immediate execution, I believe we used the ability of gmake to make makefiles. It was basically fine, but I would note that git submodules are really only barely less tedious and busy work-inducing than copying the files around.

I'm not really sure I agree that the build should warn if the latest eng.git changes haven't been pulled in, or that we should necessarily try to keep every repository up-to-date for its own sake. In the past we've pulled in changes as resources were available to do so, or when we needed a new feature (e.g., the node_modules stamp-based build support). Obviously if some external factor means that the eng.git version in use in a particular (or every) repo is now suddenly not working, we'd need to update them all at once to fix that -- but if things are still working, I'm not convinced there's a huge impetus to do the review and testing across the entire software estate for every eng.git change.

What's the plan for actually managing package-lock.json files? It seems like this means a critical bug fix in something commonly used like bunyan will now require us to touch every part of the software estate to roll that out. Presumably we'll also need to have a new automated process that periodically checks that we haven't prevented ourselves from taking on security fixes in every transitive dependency we have in every repository as well.

timfoster commented 6 years ago

I was putting it outside deps/ since it wasn't a component dependency, rather a build-time dependency, but I really don't mind either way (one less directory traversal, but meh). The mechanism medusa uses looks good too, though I'm not sure which is the better approach.

I somewhat agree that eng-as-a-submodule is little better than copying Makefiles today, when there are few changes to its content, but am not wild about the potential for components to cherry-pick Makefile updates that may not be necessarily tied to a well-described bug/commit that was filed for the corresponding eng change. My hope for the future though is that we start bundling more build tools in eng which are likely to change more frequently, with "buildimage" being the first candidate.

We have no way to distribute build tools across all components without doing this (similar to usr/src/tools in ON, or /opt/onbld as packaged on build machines that get pkg updated biweekly) so eng as a submodule seems like the best way of achieving that. I'd also hope that when we have a list of triton/manta components that use eng, we'll be able to write some automation to propagate the changes across components (taking lots of care not to break anything)

For dealing with package-lock.json, absolutely yes - I'd like to write automation that removes package-lock, does a build, then compares the newly generated file to the committed version, giving us visibility into exactly what's changing. This could be done periodically as a jenkins job, or on-demand: my main worry today is that we don't seem to be tracking our changing dependencies at all

cburroughs commented 6 years ago

Thanks for putting this together. It will be great to get to a point where people can build a fully working image on their first day instead of maybe by the end of their first year.

I think there is also a somewhat different tradition within Manta and Triton on npm depdencies. Manta components have been somewhat more likely to use flaoting dependencies while Triton components tend to specify exact versions. npm of course, as you noted, doesn't give you any useful tools for corralling transitive dependencies short of locking the whole graph.

This may be somewhat Triton-centric, but another point of context is most of our tests are of the brittle and slow end-to-end variety. We have very few unit tests. In other words we don't have a viable workflow for validating a high volume of small changes which is why so many changes today are large and semi-manually tested. I think that's a downward spiral we need to pull out of, but it will be a long arc to do so.

I have a specific question here, but rather I think it would help if the RFD layed out some options for how the lock files could be managed, and how multiple strategies could co-exist (or not).

timfoster commented 6 years ago

No problem - thank for taking a look! In terms of build performance, I was just going to ptime an image build before and after the change.

I don't know the answer to your question about delegated datasets - it sounds like "yes, you can't use just cloudapi" is the answer though. What's the workaround? Is there a way to modify an instance as an operator to allow the use of delegated datasets?

Submodules

I've spent a while playing with these, and find myself wishing we had them to manage usr/src, usr/closed, usr/fish and usr/man back in the Solaris org! To answer your questions though:

Wrt. other work like buildimage, I had hoped to use eng as a place to store more common build tools. In particular

While we could deliver buildimage as a node package, I don't really see the value of doing that - our build is the only thing that'll ever need it - would we ever try to install and run it outside of a build workspace?

Reproducible builds

There's two things we need to be able to get there, from what I can tell:

The former seems doable, but I can't do it alone - it requires modern npm, which in-turn requires modern nodejs, and doing that means more familiarity with node, and the way our components use their dependencies.

Without the former, we could still do the latter, writing tooling to compare two adjacent builds to see where date-stamps are creeping in etc. https://diffoscope.org/ might be able to help with some of that. My feeling is that by starting to look at the common reasons builds are not reproducible today, and addressing those as we uncover them across components, we should be able to add to engdoc advice (and eng prototype or shared Makefiles) and ideally add more tooling to the build to catch instances where builds are not reproducible. (makefilelint?)

For two builds within the same workspace, being able to at least generate a report showing how the two built images differ would be useful for me as a developer, providing assurance that the only changes in the image to be tested are the ones I introduced.

Build environments

Yes, here I had meant "the machine on which builds are produced" - in my case that's a jenkins-agent image running in a SmartOS VMware fusion instance on my laptop, with me logged in as 'timf' accessing git repos that are NFS-shared from my local NAS.

I don't want to dictate how a developer chooses to develop their changes, but would like to at least be able to say "if you build on exactly this environment, then that will accurately reflect how we do builds that eventually make it onto production machines." and would like to make it easy for an engineer to replicate that environment.

Then, at build time, we ought to be able to easily determine whether we're building in that environment, or something else.

I totally understand that we have multiple build environments today - so far, I've got four different jenkins-agent VMs that I use, but no clear way to determine at build-time, which one I should login to in order to build any given component. The only way I find out if I've chosen the wrong one, is if the build blows up.

So even if we don't consolidate to a single build environment, I ought to be able to tell whether the one I'm currently using is correct for this component.

timfoster commented 5 years ago

Now that the 4 phases of the Lullaby 3 project have integrated, I propose to move this RFD to "publish" state. I'd like to add/modify a few sections. These are:

Drafts of those are:

===

Improve the sharing of build infrastructure by using eng.git as a submodule

By moving eng.git as a submodule of all Manta/Triton repositories, eg.

<component>/deps/eng/

we establish a place where global build tools and improvements can be made that will be available to all repositories.

Before this RFC, shared Makefiles were simply copied from eng.git into each repository's tools/mk directory, which allows components the choice of deciding when to update to new versions, but had the drawback that each upgrade is a copy/paste job, and could result in build improvements not propagating across repositories.

By using a submodule instead, repositories still have the choice of upgrading, as git submodules are locked to a specific commit until upgraded. For those unfamiliar with git submodules, performing the upgrade of a component to use the latest changes from eng.git is as simple as:

$ git submodule update --remote deps/eng
$ git add deps/eng

Developers should then commit changes as usual. Some deps/eng updates may need to be applied across all repositories that use eng.git as a submodule, other changes may be performed on an as-needed basis for the specific repositories that require new eng.git features or bug fixes.

Before this RFD, we were being conservative about which new eng.git code was used in each component. This RFD proposes we become aggressive about taking all eng.git changes, at least to the granularity of a single git commit, but ideally having all components stay current.

We replace uses of shared tools/mk/Makefile. with deps/eng/tools/mk/Makefile. and ensure that the eng submodule is always checked out and present in a repository using a .//Makefile macro definition:

REQUIRE_ENG := $(shell git submodule update --init deps/eng)

which allows for subsequent include deps/eng/tools/mk/... statements to Just Work.

If developers are modifying files in deps/eng, for example when testing eng.git changes, then uncommitted files will not be overritten as a result of the $(REQUIRE_ENG) macro, however any commits made to a local deps/eng repository would be clobbered, so either comment out the REQUIRE_ENG macro, or temporarily point the deps/eng submodule to a local eng.git clone, taking care to revert that change before committing:

$ git config -f .gitmodules --replace-all submodule.deps/eng.url /home/timf/projects/my-eng-clone.git
$ git submodule sync deps/eng
Synchronizing submodule url for 'deps/eng'
$ git submodule update --remote deps/eng
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 2), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From /home/timf/projects/my-eng-clone
   d25b8fc..7a03a85  master     -> origin/master
Submodule path 'deps/eng': checked out '7a03a85352a6035b1f098ff41fb7ff112bda052b'

===

=== Moving components towards using npm v5.1.x involves upgrading the version of node that many components use. From experimentation, the earliest node version in use across Manta/Triton that is capable of running npm 5.1.0 is node v4.6.1.

A side effect of using a package-lock.json file, is that we may be able to more effectively cache downloaded npm content and thereby reduce our reliance on the network, though that is not the primary goal.

As an aid to investigating build reproducibility, during publication, the build saves the result of npm ls --json which would allow us to develop automation to determine package differences across releases for components which aren't yet using package-lock.json files.

===

Current status

At the time of writing, the Lullaby 3 project is considered "complete".

We staggered the delivery of the project over four phases, with TOOLS-2043 being the Jira ticket that tracked the overall project progress.

The flag day mails for these phases were:

phase 1: build framework and a few components converted to use it

phase 2: Triton components converted to build framework

phase 3: Manta components converted to build framework

phase 4: smartos-live, sdc-headnode and firmware-tools converted to build framework

While we have not yet reached build reproducibility across all projects, it should continue to be a goal, and new projects should have reproducible builds by default.

timfoster commented 5 years ago

(sorry about the weird formatting there) I plan to push the above changes to the RFD repository on Thurs 23rd May, so please shout if you've any comments/objections before then. Thanks!

timfoster commented 5 years ago

The RFD was marked as 'published' a while back. Early in the RFD, we mention a gist that outlined some of our ideas on build systems. I'm going to inline that below in case the gist gets lost.

=== What I want from a build tool

The following are some of the attributes that I'd want to see from any build system.

Some or all of these may be present in the build system used in Joyent today (which I'm still trying to learn)

These thoughts are not tied to any one development model or build flavour: they apply as much to the builds that a developer does while iterating on a change as they do to the official builds that would happen after an integration or on a nightly or per-release cadence.

We like transparent build processes and dislike "black boxes" where a button is pushed and "magic" happens that developers don't understand.

Easy to learn and use

A build tool should be easy to learn, but also should not prevent users doing things "the long way" ('make all' will always work) However, having a build wrapper helps with:

Anything we do in a build tool should satisfy both "power users" as well as people who've never done a build before. If the build tool gets in the way of anyone doing productive work, then we've failed.

Adherence to a CBE ("common build environment")

A build of any component source code will have requirements about the system it is being built on. Having a well-defined build environment, ideally one common to as many components as possible is important. Making sure we only build on blessed CBEs is vital, as differences in build environments can result in build breakage, or worse, runtime bugs in that component which may not be detected at build-time.

Changes to the CBE need to be carefully managed so as not to introduce breakage in any component that depends on the old behaviour or any component of that older CBE that's not present in the new CBE.

Of course, shrink-to-fit applies: a developer can build on their Mac if the component allows, but nightly/production builds ought to always use the official build environment. Problems introduced due a developer not building on the CBE prior to putback may change a component's development policies.

Fail-fast - when we blow up, do so as close as possible to the crime scene

Digging through phantom build failures in log files to uncover the actual reason for breakage is not acceptable. Builds should blow up early, and make a loud noise when they do so.

Reproducible builds

Building the same source code on the same build machine should result in the same built bits.

Deterministic builds

Related to the above, building the same software twice on a similarly loaded build machine should produce build artifacts in ~= the same amount of time.

Network-local

This is here partly to satisfy the above two requirements, but from experience, build tools that rely on the network being up (whether that's to locate build dependencies, build machines, or deposit build artifacts) are prone to failure.

The network goes down, is slow for remote users, or can host dependencies or build machines that change over time, possibly even during the course of a build.

We should nail down the build environment so that a completely disconnected user can build our software. If that involves them maintaining a local cache of $world, so be it. This tends to also help building in completely isolated lab environments, or when developers are on the road, etc. By adding this requirement, we start to have more control over the CBE and thus more likelihood of always producing the same software.

One could imagine a build machine hosting its own imgapi or manta instance (or even just a simple http server) where images required for the build are pulled from, and where build artifacts are posted to, running it's own vm with a pre-populated pkgsrc server, etc.

As a remote developer, being network-disconnected is particularly important to me: being able to stand up a full build environment locally, without throwing bits back and forth over the Atlantic link is vital.

But, allow for developer conveniences

The above absolutely does NOT preclude integration with your CI of choice! All of the work to come up with a sane local build helps when you then start running those builds on Jenkins as well - it's just a "local build" running on a remote machine. One could easily imagine a build tool subcommand that submits jobs to Jenkins on your behalf.

If there are other things we can do in a build tool that makes developers lives easier, we should absolutely do that.

For example, in the past for Solaris OS/Net builds, we had a "build pkgserve" command that started a HTTP IPS server so developers could install bits on systems that didn't have NFS-access to the build machine.

Likewise, in the past, we had a phase of the build construct ISO images containing ZFS Storage Appliance ISO and upgrade images - tasks that few developers ever learned to do in the past were now just another (optional) build phase.

One could imagine similar conveniences to invoke APIs to import constructed triton images to a test machine.

Crucially though, the build wrapper is not a CI system in itself - we do not want to replace what Jenkins does perfectly well, however a well-written build system can make the creation of Jenkins jobs significantly easier (as there's much less logic to implement as part of the Jenkins job)

Versioning - at least include SCM data in build artifacts

Having git changeset information in the build artifacts and logs makes it straightforward to determine what changes are included in this build. Similarly, including that information in the build logs, along with a dump of the build environment is very useful when tracking down build failures.

Easy to read logs

This goes without saying.

Useful notifications

If the build sends notifications, it should do so concisely, showing relevant data from the build to help quickly diagnose errors, or locate build artifacts.

Avoid monolithic builds, allow composition ('make all' is fine if 'all: foo bar baz')

When a build phase fails, having to restart the entire build again is counter-productive. If there are logical build phases, we should allow the user to invoke only that phase (I'm looking at you, nightly.sh!)

At the same time, do not attempt to manage build-phase dependency resolution, that's what make is for, and if it's not obvious that one phase depends on another, that's usually an indicator that the build phases are too granular and ought to be combined.

Do as much work during the build as possible

If the build is capable of catching software problems, it should do so. Whether that's static-analysis of code (lint, coverity, fortify, etc.) or even simple code-style checking. I like to treat the build as the first line of defence for code quality, and problems caught earlier are massively cheaper to fix.

Learn from Lullaby?

Some personal history - I've tackled a problem similar to this before, rewriting the build system used by a few hundred Solaris developers.

Changing the tools that engineers are forced to use on a daily basis can be disruptive, and there was some initial resistance to the idea of change, but I believe we were successful in our goals to simplify the build and make a meaningful difference to the speed at which we were able to develop Solaris. [ talk to robj, mgerdts or jlevon, all of whom got to deal with the changes ]

I hope I can help to improve the lives of developers at Joyent too.

https://timsfoster.wordpress.com/2017/08/10/project-lullaby/ https://timsfoster.wordpress.com/2018/02/23/project-lullaby-build1-log-files/