FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
366 stars 115 forks source link

CI needs upgrading #200

Closed tmbgreaves closed 6 years ago

tmbgreaves commented 6 years ago

Buildbot 0.8.x isn't playing happily with CentOS 7 any more, and if we want to carry on using buildbot the config needs a rewrite to the 1.x syntax. Since most of the other projects we work with have moved over to using Jenkins, and Jenkins integrates far better with GitHub, I think attempting a migration to Jenkins is probably the best way forward.

I would welcome comments and feedback on this!

jrper commented 6 years ago

Definitely no objections if that's the easiest path for you. How much of the UI sugar that Jenkins offers are you considering exposing? Looking at what firedrake currently offers, there are things I'd request with e.g. the test logging, although I realise everything has security implications.

tmbgreaves commented 6 years ago

I think it probably is - I've done three Jenkins deployments recently and not looked much at buildbot for the last six months, so I'm much more up to speed with Jenkins. I was planning to be as bells-and-whistles as possible for this so open to anything people would like including.

stephankramer commented 6 years ago

Sounds like the best way forward indeed. Also much prefer to stick with something we (and in particularly you :-)) are already familiar with if it is up to the job.

tmbgreaves commented 6 years ago

Making progress here: https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity/activity

I think this is probably not a bad point to move the information in the docker repo to within Fluidity and simplify the builds a bit.

jrper commented 6 years ago

When you reach the stage that it's building, then if you're willing to add python-junit.xml (not the almost identically named python-junitxml) to the packages being installed, then testharness should spit out junit compatible .xml files summarising the results of the tests.

tmbgreaves commented 6 years ago

Updating on this - I think the Jenkins setup is mostly done now, running through some tidying up now and extending to cover the set of builds we're interested in.

jrper commented 6 years ago

Is now a good time for feedback/opinions/feature requests on the setup, or do you still have tweaks you know you need to do?

tmbgreaves commented 6 years ago

Now is a great time :-) I'm working on speeding it up a bit now - things like starting the mediumtests at the same time as the unit and short tests. Also adding in a parallel test stream for make makefiles, packaging, etc. And need to add the OpenMP and debug builders. Time at the moment looks about the same as buildbot was giving.

jrper commented 6 years ago

I think my points are all UI/UX related, and several of them would need consensus.

  1. Given our previous pattern on buildbot, we probably want runs where tests fail to be marked more clearly as not successful. Possibly even as failed.
  2. If it fits in with the structure of a parallel pipeline build, it would be nice to seperate setup, configuration, building and testing as separate Jenkins stages, so that the points of failure are more obvious on the cross tables.
  3. If enough disk space is available, it would be nice to automatically artifact out failing test directories, if not, could we have a readily downloadable version of the individual stage logs.
  4. If we're putting Jenkinsfiles into the git repository, it might be good to have a single user, local system setup as well, to trivialise the process of people rolling their own projects.
stephankramer commented 6 years ago

@jrper: Could you clarify no 1? What do you mean by "not successful vs. failed" and how do you want them marked? Agree with the other points, in particular no. 2.

@tmbgreaves : for speeding up, I think it might really be worth it to improve the dumb scheduling testharness does with parallel tests. See my comments on #182. This might also improve reliability as I suspect some intermittent failures may be due to being unlucky in the sense of testharness attempting to run all the most beefy parallel jobs at the same time.

jrper commented 6 years ago

Jenkins has a fairly fuzzy, but extensible model of "code health", which is connected to, but not exactly the same as the build result. If we want the buildbot behaviour back, then Jenkins can be made to understand the test results, and mark a build in which one test failed as failed itself. Currently, it's marking any build in which the tests complete as successful, regardless of test failures.

tmbgreaves commented 6 years ago

Am I interpreting correctly that the way to deal with test failures being passed back up to Jenkins is to call testharness with --exit-failure-count ?

tmbgreaves commented 6 years ago

Re. splitting up into stages - I agree with @jrper and am figuring out the neatest way to split up the config/build/test.

tmbgreaves commented 6 years ago

There is also the question of how useful or not it is to rebuild the base container from scratch on a per-build basis. It's a fairly high cost thing to do which is why I'd extracted it from the routine build on buildbot; I'm tempted to just have one sitting on dockerhub updated by cron on the jenkins master rather than wrapping it into Jenkins.

jrper commented 6 years ago

As far as I remember, on the last iteration we made it that --exit-failure-count will mean the test harness return code is the number of test failures. That provides coarse-grained coupling enough to fail fast on the tests. If we want finer grained coupling, you'd need to install python-junit.xml inside the containers, the unit-plugin on the jenkins host, copy the test_results.xml and test_results_medium.xml back to the host at the end of the test script and include a

post {
        always {
            unit 'test_results.xml'
        }
    }

in the Jenkinsfile.

stephankramer commented 6 years ago

Ah ok, thanks that clarifies - yes, I agree any single test failure should be marked as an overall failure and yes, if Jenkins decides that based on a nonzero exit code, than --exit-failure-count should work.

Is rebuilding the container the bit where it installs required packages starting from a bare image? If so then, yes, I don't see a need to redo that every time. As long as fluidity itself gets rebuilt. Then if the container is rebuilt, triggered by cron, maybe it's a good idea to have that trigger a master rebuild with the new container - and also have a way of identifying in the output of any build which version of the container has been used for it.

jrper commented 6 years ago

Regarding rebuilding the containers, it's definitely fair to only update periodically. One idea that springs to mind is making the rebuild a separate (minimally wrapped, non-pipeline) Jenkins project, rather than a pure cron task. That way it might be easier to trigger by hand on the odd occasion that it's needed and @tmbgreaves is indisposed.

tmbgreaves commented 6 years ago

Missing a trick here - pull the base image and then update will I think keep us both up to date and make it a lot more efficient on the build.

tmbgreaves commented 6 years ago

Brain dumping into the issue here to make sure I don't forget - need to ensure that if the pipeline repeatedly forks and rejoins, it ensures that workspaces are separate and consistently connected to the right OS for multiple os builds.

tmbgreaves commented 6 years ago

Additional note on this - trying to get internally parallel builds on parallel OS builds is causing all kinds of problems; might be easier to have multiple Jenkins projects, one for each OS ?

... which looks to be possible, with some hackery, to do from a single Jenkinsfile ...

tmbgreaves commented 6 years ago

There is a completed run on xenial now.

jrper commented 6 years ago

Do you want eyes on the test failures it's thrown up, or are they known issues with the environment?

Regarding the log files, it looks like adding OMPI_MCA_btl=^openib to the shell environment will prevent it trying to use infiniband when it can't. If anyone knows a good way to silence the warnings due to the length of /proc/mounts in docker, then I'd like to hear it.

jrper commented 6 years ago

Also, for the convenience of those watching part time, the new link is https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/activity

stephankramer commented 6 years ago

Just a little note: make -j8 test and make -j8 mediumtest are probably not doing what you expect. The -j8 just enables threading for targets as make sees them, but from its perspective there is only a single testharness call to be executed. If you want multiple tests to run simultaneously you need make THREADS=8 test but that does come with the testharness-being-dumb-as problem: it will run 8 tests simultaneously regardless how many cores each one of them uses.

jrper commented 6 years ago

Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620a65336a632f6a2989dc84d016e0fc7cfc5 I've added a commit which updates test harness to have a new option -a or --archive-failures which takes a filename and writes directories which fail to gzipped tarball under that title. It also modifies the root makefile adding a new variable TESTHARNESS_OPTIONS to expose this.

Usage would be

make TESTHARNESS_OPTIONS="-a short_test_failures.tar.gz" test
make TESTHARNESS_OPTIONS="-a medium_test_failures.tar.gz" medium test
 post {
        always {
            archiveArtifacts '*test_failures.tar.gz'
    }

I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds.

drhodrid commented 6 years ago

Just jumping on the wish-list bandwagon here… The first is more of a test harness request, but as these seem to have been thrown in to this issue, I though why not…

Request 1: Currently, test-harness assumes mpiexec as the mpi launcher. This is fine for most systems (it clearly works on the systems at IC and local machines). However, a number of systems (including those here in Australia - for example magnus - https://www.pawsey.org.au/our-systems/magnus-technical-specifications/) use aprun or srun (there are probably others too). Would it be possible to update testharness such that a user-defined launcher could be used (with mpiexec being the default)?

Request 2: would it be possible to add builds on the Australian supercomputers (both Raijin, which is in Canberra, and Magnus, which is in Perth) to the Jenkins manifesto? I think that would be quite useful in the long term, given the growing user-base. Happy to liaise with the relevant people at this side to get that sorted - I just need to be told what is required at the Jenkins end!!

Best wishes,

Rhod


ARC Future Fellow Research School of Earth Sciences, The Australian National University, Mills Road, Acton, Canberra, ACT, 2601, Australia

T: +61 (0)2 6125 3643 E: rhodri.davies@anu.edu.aumailto:rhodri.davies@anu.edu.au W: http://rses.anu.edu.au/people/dr-rhodri-davies CRICOS Provider No. 00120C

On 25 May 2018, at 4:42 am, James Percival notifications@github.com<mailto:notifications@github.com> wrote:

Regarding point 3) in my wishlist above, and as a point of negotiation, in 7ab620ahttps://github.com/FluidityProject/fluidity/commit/7ab620a65336a632f6a2989dc84d016e0fc7cfc5 I've added a commit which updates test harness to have a new option -a or --archive-failures which takes a filename and writes directories which fail to gzipped tarball under that title. It also modifies the root makefile adding a new variable TESTHARNESS_OPTIONS to expose this.

Usage would be

make TESTHARNESS_OPTIONS=short_test_failures.tar.gz test make TESTHARNESS_OPTIONS=medium_test_failures.tar.gz medium test

post { always { archiveArtifacts '*test_failures.tar.gz' }

I believe there is finer scale control available over how much and how long things get kept for, but I'm not up to full speed with pipeline builds.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/FluidityProject/fluidity/issues/200#issuecomment-391818714, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHXaKovhcSYhAEo4pvR-n2HWS1QiWDtdks5t1v8ygaJpZM4UC1lO.

jrper commented 6 years ago

Having looked at the threading in testharness a bit, I'm not sure that as written it actually does much except speed up io, due to the python global interpreter lock. I'm having a go at testing a migration from the threading module to multiprocessing, and it's promising, but might need another pair of eyes or two.

@drhodrid for pure testharness runs, I'm confident your request 1) is doable. The code already replaces the mpiexec in tests with mpiexec -n <nprocs>, so providing the other launcher has an equivalent syntax, this is just a matter of adding another option. I'm willing to attempt to do the job if you'll undertake to test it. On the other hand, if you're using pbs scripts, it's your own problem.

jrper commented 6 years ago

The multithreading updates to testharness are in 13051b2b3532a3485d8873bd9ae9ff758eda54c0.

Patol75 commented 6 years ago

@jrper Indeed good point for switching from threading to multiprocessing. I used this library few months ago and I had read online (here) that Python is actually unable to achieve any expected multi-threading speed-ups because of the GIL (see also this for why there is a GIL). Regarding your implementation, you may want to use a Manager Queue instead of just a Queue (see here). You may not have to use a Lock as well (in the script I have, I can recover all my processed data through multiprocessing.Manager().Queue().get() calls in a for loop, but that may be different as I am using a Pool). Another interesting idea is that you can give arguments to the Process call (like the Queue object), see here.

jrper commented 6 years ago

As far as I can tell from the documentation, switching to a managed queue only makes sense if there's a desire to test using a cluster in multi-node configuration. This isn't really the use case anyone has put testharness to so far, which is basically single machine, single process (due to exactly the issues you note). In testing, I found that the multiprocessing.Pool object just wasn't versatile enough for the problem we have here, where the subprocesses themselves can each spin off multiple MPI processes. The remaining difficulty is mostly scheduling. It's possible the right answer is to start with the parallel jobs and then spin up a pool of serial workers as they complete, but that needs a load of testing I'm not sure I have time for.

@drhodrid See afa8f9660aad54f0651f4073d61cccceb8eab212 for the changes to make the choice of mpi launcher a variable. This adds a new option, --mpi-launcher or -m, used like `testharness -m "srun -n". in general, the object being called has to take a space and then an integer specifying the number of processes.

tmbgreaves commented 6 years ago

Coming back to this now after getting derailed onto opesci for a while. I think this is down to two areas of test failures, both of which pass when run in the same container but interactively, so probably some artifact of the Jenkins environment. Poking at it ...

tmbgreaves commented 6 years ago

Looks like these are OpenMPI issues with /proc/mounts containing very long entries as a result of the Jenkins overlays, and the OpenMPI error is polluting tests which check error files.

tmbgreaves commented 6 years ago

Attempting a workaround for this with some creative use of sed in the test run command.

tmbgreaves commented 6 years ago

That 'fixes' the netcdf test.

tmbgreaves commented 6 years ago

Working my way back up comments - @drhodrid - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders.

tmbgreaves commented 6 years ago

Completely baffled for the moment by the three remaining failures on Ubuntu, which I've failed to replicate outside Jenkins - so have disabled them for the moment pending more detailed debugging. If anyone has ideas, please take a look at the run which focuses on these errors - https://jenkins.ese.ic.ac.uk:8000/blue/organizations/jenkins/Fluidity%20on%20Xenial/detail/jenkins-fix-failing-tests/4/pipeline/30 .... any suggestions gratefully received.

Patol75 commented 6 years ago

@tmbgreaves Is the following of any help ? https://stat.ethz.ch/pipermail/r-sig-hpc/2011-August/001053.html http://www.olmjo.com/blog/2012/09/30/openmpi-warning/

drhodrid commented 6 years ago

Hi Tim,

The systems I’m referring to are not ANU systems. They’re national clusters and, accordingly, can be accessed outside of the ANU (Steph already accesses Raijin from Imperial, for example). My motivation for having them as part of the main testing repository, rather than something stand-alone within Australia, is that:

  1. They provide system environments that are very different to those of cx1/cx2 and others in the UK.
  2. We’ve set things up to run with different compiler versions, different versions of external libraries etc…

Taken together, this would allow us to flag issues prior to merges, that would not be picked up otherwise. As an example, see the issues caused by James’ recent change to VTK, which we had to fix post-merge.

A further bonus is that it would allow us to tie in the vast expertise at the NCI and Pawsey facilities within Australia, when it comes to compiler issues, debugging etc… I’m already aware that these people are more than happy to help us in this process.

Best wishes,

Rhod

On 1 Jun 2018, at 1:33 am, Tim Greaves notifications@github.com<mailto:notifications@github.com> wrote:

Working my way back up comments - @drhodridhttps://github.com/drhodrid - very happy to work on getting some testing on the ANU systems. I will think through what's involved with that, as it sounds like it could be easier for you to set up something local which interfaces with github to post build outcomes as opposed to trying to tie it in to the existing IC-based master which would need to get through the ANU firewall to talk to your builders.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FluidityProject/fluidity/issues/200#issuecomment-393571267, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHXaKtKk-OAHpbfGiAE_LC6AdlytpL2wks5t4A1DgaJpZM4UC1lO.

jrper commented 6 years ago

@tmbgreaves Looking at the results of the failing tests, it looks like the c++ code to write the backtrace has flushed to disk, but for some reason the Fortran code before it hasn't. Investigating deeper, the assumption appears to have been made that the fortran error unit is 0, which I guess might not be true inside docker, inside Jenkins. With most recent versions of gfortran we can get these from iso_fortran_env, see d177b25e4dda75719dcd987c4510ce1c8614cb74.

The other conceivable possibility is that a buffer isn't getting flushed somewhere, but that gets more involved.

tmbgreaves commented 6 years ago

@drhodrid - That sounds great. Once we have a more solid setup on Jenkins I am very happy to work on making it happen. As a side note - I'm not sure if/how we'll get ICHPC tied into the new Jenkins setup, as our buildbot methods really don't suit the Jenkins configuration. I'm considering leaving a skeleton buildbot up to keep ttrack of ICHPC builds and the longtests.

@jrper - thanks for looking at it - how recent does gfortran need to be to get that update? I guess since we're seeing issues with post-gcc5 in other areas this is something that will stay on the wish-list for the moment?

tmbgreaves commented 6 years ago

I've filed a PR for the initial Jenkins setup - if there aren't any major objections to how I've set it up I'd be keen to get it int master so we have some CI in place, after which I'll re-branch and set up a more comprehensive testing suite.

tmbgreaves commented 6 years ago

@Patol75 - many thanks for that tip, has cleaned up the 'missing infiniband' output :-)

jrper commented 6 years ago

Digging further, that seems to be one of the bits of Fortran 2003 which actually made it into gfortran 4, so I think we're ok for an compiler we'd like to claim we support. Regardless, if you're willing to test whether that change alone is enough to sort out that problem, I'm willing to harden the changes against incompatible compilers. More importantly, if it doesn't fix the issue, then we have to treat the error logs on the Jenkins build as always potentially incomplete.

jrper commented 6 years ago

Brief update: I can get the same test failures locally in a jenkins+ docker setup, so that seems to be the necessary combination. Unfortunately the iso_fortran_env change isn't enough to clear the error.

stephankramer commented 6 years ago

FWIW: I can reproduce the failure locally if I redirect the output of testharness (i.e. ../tools/testharness.py -f unresolvable_mesh_dependency.xml >& out). The problems for me seems to go away if I put in

  flush(debug_error_unit)
  flush(debug_log_unit)

right after the ewrites in flexit_pinpoint and flabort_pinpoint.

jrper commented 6 years ago

That seems to be enough inside Jenkins as well and is probably A Good Thing™ to do. The downside is that it would only get called an abort/exit call, unless we modify the ewrite macro itself.

jrper commented 6 years ago

Two minutes of googling suggests that we also want to set the environment variable GFORTRAN_UNBUFFERED_ALL=1 to force unbuffered I/O in all situations.

tmbgreaves commented 6 years ago

Thanks for the suggestions - I'll act on them for the follow-up Jenkins tweaking PR.

jrper commented 6 years ago

@tmbgreaves Since you've announced CI as open via email, is there anything we should do to get PRs migrated?

tmbgreaves commented 6 years ago

It might need the Jenkinsfile adding to the branch, but I was hoping that it wouldn't as it tests the branch-into-master merged sourcetree which will contain a Jenkinsfile. It won't trigger a build immediately since there's no change to the PR but I suspect a repo rescan will sort that out - I'll do one now.