Closed RX14 closed 3 years ago
@RX14 This is awesome! Every point you say is a +1 in comparison to running things in travis, at least for a project this big and with many platforms to support. I'll try to discuss this at Manas next week and see how we can proceed.
In addition, if we have macos install media, it seems we can use https://github.com/boxcutter/macos to create a macos VM, so it looks like we really can test on every triple using jenkins.
Any progress on this issue?
@RX14 We didn't have time to review this yet, some of the team is on vacation at this time of the year (it's summer here ^_^) but this is something that we'll definitely take a look at, and probably switch to, once we discuss it.
The analogy in Ruby is rubyci.org running builds tests and specs on multiple platforms in the background' whilst travis and appveyor work in the
foreground' on marker versions giving `immediate' feedback to pull requests and branches. This works well since edge cases turn up constantly across the different platforms whilst travis usually catches low lying fruit. Ruby use chkbuild - a CI server written in Ruby. Apologies if you know all of this - I'm a contributor to Ruby/Spec and thought I would share. TLDR :+1:
I think I would strongly prefer running builds for every platform for every commit. Whether the PR is "okayed" back to github after only the faster builds have completed is a question which I think will have to be answered after it's all set up.
Hey @RX14, so sorry for the delay in the reply. We discussed this internally at Manas, and we agree that having a Jenkins (or equivalent) environment for managing the builds in multiple platform would be ~nice~ awesome to have. If this is something you'd like to work on, and feel comfortable in Jenkins, then Jenkins it is; we do have some experience in Jenkins and none in buildbot, so it seems the best choice to go.
The first thing to work out would be hosting. We'd prefer to handle as much as possible of the infrastructure in-cloud, having a master Jenkins node running in Amazon (where we host most of our assets), and slaves covering all the required platforms.
So step one is to build a list of all targets, and figure out the best place to run them. I guess that a node in EC2 running qemu (I understand there are no limitations for running qemu on an EC2 machine, right?) would be a good choice for most architectures, though I'm not sure if we can cover all of them this way.
What do you think?
Architectures should be tested on real hardware if possible. I discovered bugs when running Crystal on a AArch64 server (provided by Packet) that didn't happen in QEMU for example. It's also very, very, slow.
Scaleway has cheap ARMv7 servera, Packet have expensive but incredibly powerful ARMv8 servers (Cavium ThunderX Γ 2).
Same for Alpine, running in a container will be different than running in a VM, because the Alpine kernel is patched (grsecurity, ASLR, ...).
@spalladino absolutely no problem with the delay in reply.
Obviously i'd like to automate as much of the deployment of the master as possible. Docker would be my first choice as I have experience with it (and already made a container). The master node doesn't need to be that powerful. Needs probably only 2gb of ram and somewhat decent processor. We can assess if it needs more resources as the project grows.
The list of targets is here: https://github.com/crystal-lang/crystal/tree/master/src/lib_c, we'll want something for every one of those in an ideal world. I don't want to run any targets in qemu if possible, but acquiring hardware for every single target in the future is impossible. In hindsight every non-x86 slave is going to end up being a special snowflake, so there's not much point in attempting qemu for every slave.
Build slaves don't really need a particularly good connection, so they are fine to be run at home if it comes to it. For example I'd like to run the ARM targets on a raspi as that's realistically going to be the most common device they're run on. Raspi 1 is ARMv6, raspi 2 is ARMv7 and raspi 3 is ARMv8, and I know that aarch64 distros exist for the raspi 3. Raspis are quite a cheap non-recurring cost and racking solutions for raspi exist so the slave hardware for ARM seems to look bright in that direction.
x86_64 slaves should be able to be run in the cloud with ease. Even if they need to be virtualized, KVM should ensure that they're running on a real CPU most of the time using VT-x. Architectures which you can get an AMI for can be run directly on ec2. AFAIK ec2 doesn't use containerized virtualisation so we should have full control over kernel versions etc for alpine.
Another interesting question is that of LLVM versions? Which LLVM version do we test with? Testing with every LLVM version on every architecture seems like a waste of time. We could pick a random LLVM architecture for each run (maybe deterministically from the commit hash).
@spalladino This is a topic which would probably benefit from real-time discussion, so don't hesitate to ping me on IRC/gitter if you want to chat. I'll be around all day (after midday gmt) tomorrow.
For LLVM, we should test a few versions. It's usually overkill, but we may break compatibility when supporting a new LLVM version or when the compiler uses more features. So we may:
Note that ARM / AArch64 have best results with 3.9 (maybe 3.8 is enough); older versions lead to crashes in release mode for example.
BTW I couldn't find any AArch64 distributions for Raspberry 3 (only ARMv6 or v7). It can boot in 64bits mode, but when I searched a few months ago, there was no kernel (only preliminary attempts). I'd love to see a distribution, thought. Note that Packet was willing to sponsor us with an ARMv8 server, I can ping them back (or send a DM to @packethost on Twitter).
How about only running tests on all LLVM versions nightly? I'm a little wary of build times for PRs slowing down development.
What about nightly tests for all config's, and 'live' tests [per PR or per merge-to-xyz branch] only on one/few select config's?
@RX14 What about primary builds that will report a green state as quickly as possible, then additional builds that will report more feedback? Or maybe have a trigger in commit messages, to enable LLVM builds (eg: matching /LLVM/i
)? Usually we shouldn't care about LLVM except for particular branches / pull requests (supporting new LLVM version, using new LLVM features, ...).
@ysbaddaden That could work. Trying things out should be quite easy once we have the infrastructure set up.
I agree on having primary builds that can report a green state as quickly as possible. I'm not sure though about how to handle the additional builds: I think I'd start with nightly builds of master (as @RX14 suggested). Later, we could set something up to auto-merge a PR if all builds have passed and a committer has given a thumbs-up, having a quick primary build report an initial ok state.
Anyway, first we need to have the builds running somewhere, so let's go back to the hosting.
I wasn't aware of potential issues when running in QEMU as @ysbaddaden mentioned, so I guess that rules QEMU out (speed would not be much of a problem if we are using them for nightlies or "extra" builds).
Regarding LLVM versions, I'd pick a primary LLVM version for each platform for the primary builds, and then add other combinations (as mentioned by @ysbaddaden) as additional builds.
Am I missing something? What do you think? I'll be in the IRC in a few minutes if you want to follow up there, though I'd rather keep the conversation here (just for the sake of keeping the history more organised).
Maybe branches like [but possibly renamed if desired]:
@drhuffman12 I don't think that changing the git repo setup would be a good idea.
Note that the only reason we keep compatibility with older LLVM versions in the source code is because for our release process we are stuck with LLVM 3.5 for the moment, but that should be upgraded to the latest LLVM version (and upgraded each time LLVM lands a new version)... but that's kind of tricky to do, as far as I know.
So in my opinion, I wouldn't have a matrix of LLVM versions to test against. In fact, once we upgrade the omnibus build to the latest LLVM version I would directly remove Crystal code that deals with older LLVM versions.
Also, old LLVM versions have bugs, so shipping Crystal with support for an older LLVM version means shipping a buggy version of Crystal... so that's another good reason to drop support for older LLVM versions.
Supporting the LLVM stable (3.8) and qualification (3.9) branches simplifies building on many distributions. Alpine and OpenBSD ship LLVM 3.8 for example.
It's not that hard to have compatibility for 2 or 3 LLVM versions; despite the breaking changes, the C API goes through a deprecation release before removal in the following release. The current complexity is that we now support 5 versions (3.5, 3.6, 3.8, 3.9, 4.0pre) which totals many breaking changes...
@spalladino we can still use QEMU, it's very good. I wouldn't have ported Crystal to ARM without it, but it's still an emulation, not real hardware, and there may be some quirks that it doesn't exhibit. Maybe not that much, though.
Got it. Should we take longer than expected to set up real hardware, we can rely on it for the time being then.
@ysbaddaden I think for now we should be able to get real hardware for each build slave (raspi+aws(+packet)), but if we port to more unconventional architectures in the future, I think qemu will be required for those.
For now the first step would be to make a few tests with a master node and a slave node in a rather standard architecture, and see whether we want to keep slaves running or use jcloud to start and terminate them on the fly, and check if it works to use multiple AMIs for isolating configs (or we need to rely on Docker or similar). Chris will be kindly making a few experiments during the weekend and we can pick up from there.
I couldn't find any AArch64 distributions for Raspberry 3
I stand corrected, it appears openSuse has one: https://en.opensuse.org/HCL:Raspberry_Pi3
@ysbaddaden Thanks for the link! I think that archlinux-arm has an AArch64 distro for raspi 3 too (scroll to bottom): https://archlinuxarm.org/platforms/armv8/broadcom/raspberry-pi-3
Just a note here - CircleCI offers macOS builds for Open Source projects. We should probably ping them when times come.
I've done a bit more work on the CI, and got the AWS slaves launching. For example this build was performed on a tempoary build slave provisioned on AWS. The build slaves stick around for a configurable time period before being terminated.
The AWS builds are run on t2.medium
instances, utilizing a cutom AMI built using packer. Currently the AMI only includes llvm 3.5, so the build fails. Over the weekend I'll install multiple llvm packages (3.5, 3.6, 3.8, 3.9, 4.0 seem to be available) from the more cutting edge debian repositories, and then set up a large matrix build. I also need to figure out multiarch to run 32bit builds.
Therefore, next steps seem to be:
crystal-lang/crystal
!After that comes researching and setting up build slaves for additional architectures.
I'm now building in parallel with LLVM 3.8, 3.9 and 4.0! See a working build here.
Unfortunately, I've had to upgrade the instances again to t2.large
, so that they have 8gb ram. compiling all_spec
with only 4gb of ram seemed to get OOM killed. Travis's sudo: required
builds seem to have 7.5gb ram available, and maybe OSX builds are just more efficient.
As for moving the Jenkins master and slaves, I'm ready for that to happen. The Jenkins master needs a box with say 2gb of ram and not much CPU at all. This doesn't have to be on EC2, but if it was I'd recommend a t2.small
. For the slaves, I just need a AWS user with this IAM role (however a full user account may be useful for debugging).
Again, just ping me on IRC/gitter if you want a chat about any details I may have missed out :)
Awesome, thanks Chris! I'll contact you during the week to get everything set up. Cheers!
Sent from my mobile
On Feb 25, 2017 3:05 PM, "Chris Hobbs" notifications@github.com wrote:
I'm now building in parallel with LLVM 3.8, 3.9 and 4.0! See a working build here https://crystal-ci.rx14.co.uk/blue/organizations/jenkins/crystal/detail/feature%2Fjenkinsfile/37/pipeline/ .
Unfortunately, I've had to upgrade the instances again to t2.large, so that they have 8gb ram. compiling all_spec with only 4gb of ram seemed to get OOM killed. Travis's sudo: required builds seem to have 7.5gb ram available, and maybe OSX builds are just more efficient.
As for moving the Jenkins master and slaves, I'm ready for that to happen. The Jenkins master needs a box with say 2gb of ram and not much CPU at all. This doesn't have to be on EC2, but if it was I'd recommend a t2.small. For the slaves, I just need a AWS user with this https://gist.github.com/RX14/2922b8c6f96fc64a41644cc03a015917 IAM role (however a full user account may be useful for debugging).
Again, just ping me on IRC/gitter if you want a chat about any details I may have missed out :)
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/3721#issuecomment-282501169, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaOJDcjvsfMujJtIuwz78OxheRweGMMks5rgG1MgaJpZM4LP4_F .
@RX14 a few questions:
Again, thanks a lot for all the work with this! It's looking really awesome!!
There are other cloud providers that are much more affordable than AWS EC2. Take a look at https://www.ovh.ie/vps/vps-ssd.xml for instance. This listed pricing is whole-month price. But they also charge per hour: monthlyprice*2 / 720
per hour. It could be worth it.
They have an API, but this docker-machine plugin could be used: https://github.com/yadutaf/docker-machine-driver-ovh
The limiting factor for running 2 workers per node is ram while compiling, but compiling is only a small portion of the time running the stage. So I want to create a small script which will "lock out" the compiler so that compiles run sequentially, but running the spec suite is done in parallel.
Hope that answers your questions. I'll be on irc after about 5pm gmt.
That's great, thanks Chris. Let me sync with the rest of the team tomorrow (yesterday and today are holidays here in Argentina) and I'll get back to you. Can you mail me to spalladino at manas dot tech your public SSH key in the meantime, if possible signed with your key from keybase?
@lbguilherme we could look into other cloud providers in the future, however ec2 is a good start because it has an existing mature Jenkins plugin so it's quite easy to set up. The actual machine provisioning script only assumes a basic debian 8 base, so could be used to create images on numerous clouds. A cheap and simple option for decreasing some load on the Jenkins would be to run builds on containers on developers machines (thinking more desktop than laptop though). Although that route is likely to cause more pain on the communication and operations side than it saves on the money side.
@spalladino If you have the keybase app (kbfs) installed, my current ssh public key is at /keybase/public/rx14/id_rsa.pub
. If not, I'll send you an email.
@RX14 I refuse to start this message without thanking and clapping out loud your work, so here it is: πππππππππππππ
Now, back to business π
Do you know if it's possible/difficult to hook different providers side by side with EC2?
Last week I've tested CircleCI for OSX builds. It's a 14-day trial, but it seemed to work really nice. More specifically, it didn't have any queue time when building - and that's what's killing us in Travis.
We've been discussing with @spalladino the possibility of running the per-commit builds using Travis and Circle (Linux x86_64 with LLVM 3.5 & 3.9 on Travis, OSX with LLVM 3.9 on Circle), and maybe run nightlies on EC2. By splitting the workload between the providers we get quick feedback - Travis doesn't usually queue the Linux builds, and neither does Circle for OS X -, and we get lighter bills from EC2.
But, of course, this depends on a couple of things - mainly, in the feasibility of hooking your Jenkins setup with Circle/Travis, and a little bit on Circle's kindness in giving us some nice free plan.
We're working on the monetary side of this - we'll get back to you soon, I hope. If you can give us some thoughts about using Travis/Circle mixed with EC2, that'd be awesome :)
Thanks, once again!
In regards to running 2 workers per slave, I've found that even just running a compile and a spec process at the same time can run out of memory. In addition, I was getting some weird errors (1) (2), even when setting CRYSTAL_CACHE_DIR
.
I'm working on a reply/thoughts to what you said @matiasgarciaisaia, but CircleCI/Jenkins integration will be minimal if possible at all. Can you expand on what sort of "hooking up" you would want between Circle/Jenkins.
One thing major thought is that the vast majority of the builds will either fail on all platforms or none, so having a quick build script which simply runs make std_spec crystal
, tests the examples, and checks formatting would be great for the 99% case. The question is then whether we should allow the 1% type of commit onto master and catch it in the nightly, or test every commit on an expanded set of architectures. Maybe we should "opt in" some PRs which touch a lot of architecture stuff to run every commit on a large matrix. However with such a system we gain a lot of complexity.
Just some thoughts.
We were thinking about Jenkins not performing the Linux x86_64, Linux i386, and OSX builds, but consuming its OK/FAIL status from TravisCI/CircleCI jobs as the ones currently running. So Jenkins runs the jobs for the other platforms, and provides something like a dashboard to see how the builds are going, but delegates the actual build to third parties for a couple of platforms.
We want Jenkins, also, because we should eventually plug some (physical) RasPi's and things like that to also build on those platforms, so Jenkins would give us the flexibility of having whichever slaves we want, while we keep outsourcing the hardware requirements for platforms that can easily be outsourced.
But my 30-second Google-search suggests there's no such thing as a Jenkins/Travis connector or whatever I was dreaming of :/
The question is then whether we should allow the 1% type of commit onto master and catch it in the nightly, or test every commit on an expanded set of architectures. Maybe we should "opt in" some PRs which touch a lot of architecture stuff to run every commit on a large matrix. However with such a system we gain a lot of complexity.
How about looking for a keyword on the commit message(s) for triggering a full matrix build, as is done with the [ci-skip]? So, if a commit is particularly sensitive, we can add a [fullbuild]
(or a better name) to the commit msg. If we miss it, it would still be caught as part of the nightly.
@spalladino I think it's better done using github comments by team members. That used to be how Jenkins approved building PRs before 2.0 and pipelines. It doesn't seem to be possible now in the pipeline plugin. Also, encoding metadata in commit data (especially the title) is quite ugly.
Absolutely, I agree that github comments are much better; I was aiming at commit messages as I expect them to be easier to implement.
On Fri, Mar 3, 2017 at 1:42 PM, Chris Hobbs notifications@github.com wrote:
@spalladino https://github.com/spalladino I think it's better done using github comments by team members. That used to be how Jenkins approved building PRs before 2.0 and pipelines. It doesn't seem to be possible now in the pipeline plugin. Also, encoding metadata in commit data ( especially the title) is quite ugly.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/3721#issuecomment-284004508, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaOJI-6mb7O7H1P-PZansPWIt6tRrKOks5riELggaJpZM4LP4_F .
-- Santiago Palladino Manas Technology Solutions [ar.phone] 4796.0232 [us.phone] 312.612.1050 [email] spalladino@manas.com.ar [web] man.as/spl
I think that the best thing to do is to use CircleCI or travis for an "initial smoketest" which runs on every single PR. This initial smoketest would be as simple as make std_spec crystal
, followed by testing samples and crystal tool format --check
.
A small bot can then be written (in crystal!) which looks at github issue comments, and schedules jenkins builds using the jenkins API. Jenkins builds would run a full matrix validation suite. I'd argue that we should run full validation on every commit to master as well. Using a custom bot has the advantage of somewhat decoupling the interface to CI from the implementation so we can possibly make the CI more fine-grained (run just windows/osx/mac) in the future. This kind of setup seems very similar to what swift has.
But first, we should get something simple and useful working, I suggest this: build every push to master, on several LLVM versions, using EC2. From there we can evaluate the cost of running on EC2 and what optimisations are worth the effort. I do somewhat feel we're getting lost in the details and optimizations instead of getting something working...
Ok, I've set up the Jenkins master on the new server, and written some reproducible instructions on how to set it up here: https://github.com/RX14/crystal-jenkins/tree/master/master. The Jenkins is now live at https://jenkins.crystal-lang.org/. Next steps: get EC2 credentials and configure Jenkins with them (and document), and set up the actual job (and document).
I'll try to focus on good documentation for this infrastructure, because I want the infrastructure to be well-understood even if i'm busy. If you have any questions or suggestions, don't hesitate to ask me to improve the documentation. Also, do you think moving RX14/crystal-jenkins to the crystal-lang organization would be a good idea?
@ysbaddaden - there is also a new ARMv8 Debian build for Raspberry Pi 3 at https://blog.hypriot.com/post/building-a-64bit-docker-os-for-rpi3/ which you might find of interest.
I'm working with Packet on getting their ARM infrastructure together for CI builds, and will bring this issue to the attention of the team here that's doing this work.
Build is set up on the new infastructure and (somewhat) working: https://jenkins.crystal-lang.org/blue/organizations/jenkins/crystal/detail/feature%2Fjenkinsfile/1/pipeline
Next steps include fixing https://github.com/crystal-lang/crystal/issues/4089 and merging the completed Jenkinsfile (i'll PR it soon). Once this is done we can start running builds and setting status checks on master and PRs.
:)
I've very nearly got 32bit support working on the CI, but i've hit this problem: https://jenkins.crystal-lang.org/job/crystal/job/feature%252Fjenkinsfile/11/execution/node/13/log/. It appears to me that my -m32
link flags aren't being passed to cc
on macro runs. This is exactly what you want when cross-compiling typically, but as my libcrystal.a
has been compiled using -m32
, this simply doesn't work. It looks like we need another kind of link flags which get passed to every cc
invocation. @asterite what are your thoughts?
Hey @RX14!
I think I agree with you, at least in this case - the link flags should get forwarded to the macro run. I'm not sure, however, if there's any scenario in which you want those flags to not be passed to the macro run.
I'm still trying to reproduce this issue - have to set the environment up and whatever - because I can't see it from the code. The only mentions there are to CC
or cc
are in compiler.cr
, and they all include the @link_flags
.
But please do fill the issue, so we can track it down π
@RX14 Something I don't get is, in the current ci there is a 32 bit environment and the specs are passing there. Why a change in the infrastructure would trigger the behavior of passing the link flags? I am not discussing if it should or not (I am not fully convinced), but I wouldn't expect that to be an issue since there a 32 bits environment in travis running. What has changed?
Currently, crystal uses Travis CI for continuous integration. This works well, but has some limitations. Travis currently allows us to test on our major architectures: linux 64 and 32 bit, and macos. However, in the past year we have gained ARM support in 32 and 64 bit, as well as support for freebsd/openbsd. These architectures would be difficult to test using travis. Without continuous integration on a target triple, that triple is essentially unsupported and could break at any time. In addition, travis lacks the ability to do automated releases. This makes the release process more error-prone and precludes the ability to do nightly releases.
I have been working on setting up Jenkins as a replacement for travis. Jenkins is a much more flexible system, as it allows connecting your own nodes with their own customised build environment. For example we could test crystal on an actual raspberry pi for every commit. We could also schedule jobs to create nightly builds, and authorised users on the web interface could kick off an automated release process.
Currently I have a test jenkins instance running at https://crystal-ci.rx14.co.uk/, here is a (nearly) passing build. Jenkins builds can be configured by a Jenkinsfile in the repository like travis. Here's the one I made for crystal. I've documented the setup for the master and slave instances here. Currently i'm thinking of running every slave in qemu/kvm on a x86_64 host for consistency between slaves. Automating slave installs using packer seems trivial.
There are quite a few different options for jenkins slaves however. It's possible to create jenkins slaves on the fly by integrating with different cloud providers. This has the added benefit of the environment being completely from scratch on every build. It also may be cheaper, depending on build length, commit frequency, and hardware constraints. It might also be wise to mix this and the previous approaches, for example using some raspberry pis for arm, a long-running VM for openbsd, and google compute engine for the x86 linux targets (musl in docker?).
Rust seems to use buildbot instead of jenkins, but Jenkins has really surpassed buildbot in the last year in terms of being a modern tool suitable for non-java builds (released 2.0, added jenkinsfile, seamless github integration like travis). I also have 3-4 years experience working with jenkins, but have never worked with buildbot before.
The problem I have in proceeding is that I don't know the options and preferences @asterite and Manas have in terms of infrastructure and how they would like this set up. Before I sink too much into creating qemu VM images to run on a fat VM host.
TL;DR: CI on every target triple? Nightly builds? Yay! Now how do I proceed?