adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

Discussion: Should we, and can we, change our machine provisioning model to be an on-demand one? #199

Open geraintwjones opened 6 years ago

geraintwjones commented 6 years ago

I'd like to have a conversation about changing the way that tests and builds get hold of machines.

Right now we have a static list of machines that are assigned as build, jck or test, and that jobs kicked off from Jenkins are assigned to the right "kind" of machine (platform, o/s, function, etc).

These machines, as far as I'm aware, are created manually (I think) from the various clouds to which we've been granted access by our sponsors.

These machines often sit idle when no jobs are running.

Would it not be a better use of resource to change our machine provisioning model to be one where the Jenkins jobs request the machines they need on demand? Is this even possible? Is it something the community wants?

I'd like to bring this up on the infrastructure hangout on Tuesday 13th February. If you would like to help move this conversation forward, please add comments here and/or join the hangout on Tuesday. I think @gdams posts the hangout URL on the #infrastructure Slack channel.

karianna commented 6 years ago

I'm OK with this in principle. However, each type of build does need to run on an identical host for consistency and auditability.

geraintwjones commented 6 years ago

Thanks @karianna. Naturally before changing anything, we'll need to (a) make sure it's something the community wants/needs and (b) make sure we figure out how to make sure that tests, builds, etc get exactly what platform they need.

karianna commented 6 years ago

FWIW, I think it's the right approach, it would utilise the farm far more effectively. And our needs for the farm will be expanding rapidly with the 6 month release cycle + mobile + OpenJJFX wanting to come on board.

geraintwjones commented 6 years ago

We know it is possible to get Jenkins to pull a machine out of an OpenStack cloud. The challenge will be how to get Jenkins to (a) decide which of the donated cloud(s) satisfy a job's requirements, (b) pick one of those clouds, (c) pull a machine out of the selected cloud. We'd need to understand what APIs the donated clouds support.

smlambert commented 6 years ago

Part 1: make sure we are using the statically allocated machines most effectively Part 2: determine what machine needs we will have in the next 1-2 years and decided if a dynamic story is warranted and what requirements we would have of that system

This discussion will benefit from some metrics gathering.

1) Machine utilization of the current set of statically allocated machines. Does nagios or the Jenkins monitoring plugin (https://ci.adoptopenjdk.net/monitoring) tell us some useful information. And based on that info, answer some questions:

2) Average build/test execution times per platform / version / implementation: X number of builds on X number of platforms and what is the average execution time of each X number of test builds on X number of platforms and execution time of each

For example, looking at a single platform (x86-64_linux): Tests currently enabled on x86-64_linux platform (spanning hotspot/openj9/sap implementations, and Java8/Java9/Java10 versions), excluding JCK tests for now as they are on different network: 7 openjdk regression test builds x ~4hrs each = 28hrs daily execution time 5 system/load test builds x ~5hrs each = 35hrs daily execution time <--- add more variations if more capacity (expect execution time to increase 10or20-fold, if we do heavy stress/load testing) 2 external test builds x 1hr each = 2hrs daily execution time <--- will increase as we add more 3rd party app testing (PRs imminent, target +1hr per new app added) 2 perf test builds x 1hr each = 2hrs daily execution time ---> will increase as more benchmarks added

So currently have 65+ hours of daily testing on x86-64_linux machines, with much more ready to enable (mainly stress/load tests) if we have more capacity.

For a particular build of 1 version&&implementation&&platform (example Java8 hotspot on x86-64_linux), compile/build time is ~1hr.

In general, for every 1 hr of build, we have 11+hrs of testing. We have 27 build machines and only 12 test machines (counting the 4 new ones added this past week).

3) What are the upcoming 2018 plans for build and test (what new builds and testing will be enabled)? @karianna mentioned a few new builds (mobile/OpenJJFX) for 2018. I'd like to add more stress/load testing, perf, and more 3rd party app testing as a start, I've been holding off on enabling these, as test machines are typically busy (though i see idle times on the weekend, so will look to put some items as weekly scheduled tests).

gdams commented 6 years ago

So my only objection to this is that I don't like having both build and test dependencies on the same machine. I am already happy with the way that machines are provisioned because we are installing both sets of dependencies on all machines and I feel that we need to split this. I am aware that the test team has a lot of machine requests open right now and I am trying to tackle them as quickly as possible so that we aren't blocking anyone from testing our binaries.

karianna commented 6 years ago

FYI - the immediate requests coming in are OpenJFK, the amber forest of OpenJDK (which holds certain Java 11+ features) and then Java 11 itself. I'd like to complete our existing coverage of 8, 9, and 10 though.

mleipe commented 6 years ago

On Friday @smlambert and I discussed having all but one machines of every platform available for both build and test, with the "but one" machine dedicated to build. Our thinking was that this would provide more machines to test without starving out build.

Thoughts?

karianna commented 6 years ago

My concern would be that the requirements for test would conflict pollute the requirements for build (native lib versions, etc). If that turned out not to be the case then I'd cautiously say that's OK.

smlambert commented 6 years ago

I should note that downstream consumers of the AdoptOpenJDK infra/releng/test scripts/playbooks are configuring machines for build and test (to best utilize their machine farms) and have not yet encountered any 'pollution' issues.

As in, internally and at the Eclipse OpenJ9 project, we are running tests on same machines as we build with no adverse affects.

I am going to hazard a guess that other potential consumers of the AdoptOpenJDK infra/releng/test collateral (other companies we bring on board) will also like to most effectively use their machine farm by using machines for both build and test but know that its 'tuneable', configure machines to do both, set their labels to be only ci.role.build or ci.role.test if you want to have them only used for separate roles.