FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
365 stars 115 forks source link

Bugfixing gcc10 #308

Closed tmbgreaves closed 3 years ago

tmbgreaves commented 3 years ago

gcc10 is the default on Archer2 and also becoming the default on Ubuntu as of Groovy. An initial test build on Archer2 suggests there are some outstanding bugs with 10 which need fixing.

To-do list for this issue:

tmbgreaves commented 3 years ago

Self-assigning building a docker container on groovy

tmbgreaves commented 3 years ago

Almost complete with the Groovy package set, bar Zoltan, which fails to build if --enable-f90-interfaces is passed to configure. Without this, it builds fine, which is the state of the distribution trilinos package.

The error I get is: https://pastebin.com/RpZAv33d

Reproduce with:

docker run -it fluidity/debugimages:groovy20210325-000 bash -il
cd zoltan-3.83/Build/
make

@stephankramer I wonder if you have seen this one via Debian and whether you know if there's an easy fix?

tmbgreaves commented 3 years ago

Documenting progress with the Fluidity build, enabling sam to work around missing zoltan I get to:

        FC Petsc_Tools.F90
Petsc_Tools.F90:1129:31:

 1129 |       PETSC_NULL_INTEGER, nnz, 0, PETSC_NULL_INTEGER, M, ierr)
      |                               1
......
 1305 |       PETSC_NULL_INTEGER, d_nnz, PETSC_NULL_INTEGER, o_nnz, M, ierr)
      |                                 2
Error: Rank mismatch between actual argument at (1) and actual argument at (2) (rank-1 and scalar)
gmake[1]: *** [Makefile:124: Petsc_Tools.o] Error 1
make: *** [Makefile:244: lib/libfluidity.a] Error 2
root@4d7b864cecf4:/groovy/fluidity# 

Reproduce with:

docker run -it fluidity/debugimages:groovy20210325-000 bash -il
cd fluidity
make
tmbgreaves commented 3 years ago

Reference Dockerfile for the debug container (fluidity/debugimages:groovy20210325-000) is https://pastebin.com/hiBD76xi

stephankramer commented 3 years ago

The compile errors in fluidity are actually the same as in zoltan. They are to do with shape mismatches between actual arguments (in the call) and the dummy arguments (in the called subroutine) which are allowed in fortran (old fashioned but unfortunately ubiquitous for instance in mpi calls) but apparently you do have to be consistent between different calls in the same source file? See https://gcc.gnu.org/gcc-10/porting_to.html. The easiest fix for zoltan I think is to just use the -fallow-argument-mismatch flag that is mentioned there. For fluidity I'd like to avoid it if we can, I've pushed some fixes in petsc calls in a gcc10-fixes branch. I'm now stuck at it not compiling h5hut, which I think we've discussed previously with @angus-g ? @angus-g could you remind me, and do you know how to get h5hut to compile with gcc10?

tmbgreaves commented 3 years ago

Thank you very much @stephankramer ! The h5hut fixes are as per lines 71-73 in:

https://github.com/FluidityProject/buildscripts/blob/main/uk/ac/archer2/build.sh

I don't suppose you fancy wrapping those into your branch too? :-)

stephankramer commented 3 years ago

Ah, so these were for gcc9 already. Strangely they don't seem needed for gcc10 but I added them in anyway as presumably we also want to compile with gcc9. My failure was actually later on in femtools/H5hut.f90 where it includes fortran files from h5hut/. Fixed in h5hut/ now. Then there were some further small fixes amd it now builds (including fltools). I see some segfaults in tests though, so not entirely sure we're there yet.

tmbgreaves commented 3 years ago

If we're through the build that sounds like great progress - thank you very much @stephankramer ! Shall I aim to get a CI build of unit and short tests going this week so we can take stock of where the problems are? I've been tinkering around getting a Github Actions workflow going which I could set up with Bionic, Groovy, and Focal to test (it's currently clean on bionic up to the end of short tests).

stephankramer commented 3 years ago

Yes that sounds good. It seems at least some of my segfaults are to do with a bug in petsc >=14.3 reported here: https://gitlab.com/petsc/petsc/-/issues/868

tmbgreaves commented 3 years ago

Thank you very much @stephankramer ! I think that's probably not a major issue for us at the moment as other than the Groovy build (which is very unofficial, being non-LTS) everything bar the very recent build on the Young cluster is using v3.13.3 or earlier - but getting it fixed before the package that makes it into Ubuntu 22.04 would be great :-D

For the purpose of revealing other bugs would it be useful for me to rebuild the Focal package (13.3) into Groovy?

tmbgreaves commented 3 years ago

Actions support on Bionic added with commit 8cd03f93f - once I have a zoltan package built and pushed to the PPA I'll rebuild a base groovy package (with 13.3 if that's useful) and push to dockerhub as an additional baseimage, then add an actions builder for Groovy.

tmbgreaves commented 3 years ago

Interestingly that seems to have thrown up that the h5 changes aren't backward-compatible with bionic.

tmbgreaves commented 3 years ago

I've now pushed a new groovy zoltan package and a fluidity-dev package that depends on it.

@stephankramer - in the process I noticed that the PETSc built for Groovy is 13.4, so earlier than the version you reported the bug in. Not sure if this implies a different source for the segfaults you were seeing, or that there are later patches backported into the Ubuntu build?

tmbgreaves commented 3 years ago

I have a groovy base container which completes the configure and build locally so pushed to dockerhub as a new baseimage and a groovy Actions workflow added in 0242daa .

tmbgreaves commented 3 years ago

I appear to have messed up the Zoltan package build somehow, or the build container setup - /usr/include/zoltan.mod is present but for some reason not being picked up by configure. This looks to be common to both the focal and the groovy builds. To be investigated...

tmbgreaves commented 3 years ago

It appears to be fixed if FCFLAGS="-I/usr/include" is explicitly set, but not if not. I'd thought that this would be a default include path, but apparently not. Answers on a postcard .. ?

tmbgreaves commented 3 years ago

There look to be all kinds of fun and games awaiting us in Focal with both the vtkpython segfaults seen elsewhere popping up, and the GMSH4 requirements addressed in PR #235 . I'll push an interim pair of focal and groovy containers with GMSH v2 to mask out those failures. It's not entirely clear to me where's best to go with this issue at the moment - it seems to be growing well beyond the initial remit of fixing gcc10 bugs, which might be a good thing but might also make the changeset become a bit of a behemoth.

tmbgreaves commented 3 years ago

With GMSH reverted to 2.x, Actions passes all unittests for Groovy and fails the following short tests, which I'll dig into next week. See https://github.com/FluidityProject/fluidity/actions/runs/712292922 , though for some reason the JUnit publishing isn't working properly.

Summary of test problems with failures or warnings: check_use_statements.xml: W lagrangian_detectors_3d_1e2.xml: F parallel_coarsening_2d.xml: F channel_wind_drag_parallel.xml: F parallel_refinement_2d.xml: F parallel_p0_consistent_interpolation_3d.xml: F spherical_patch.xml: F flredecomp_repart.xml: F particle_attributes_checkpoint_parallel.xml: F flredecomp.xml: F particle_attributes_checkpoint_parallel_flredecomp.xml: F lagrangian_detectors.xml: F viscosity_2d_p0_parallel_local_assembly.xml: F heat_2d_dg_local_assembly.xml: F pvtu2vtu_dg.xml: F flredecomp_2d.xml: F flredecomp_1d.xml: F lagrangian_detectors_increasing_rotation.xml: F lagrangian_detectors_oscillating_rotation.xml: F

Passes: 1540 Failures: 18 Warnings: 1

tmbgreaves commented 3 years ago

The set of failures up to the end of short testing in Focal is the same.

tmbgreaves commented 3 years ago

Of these all but three pass for me locally in the Actions-built container, the three that fail locally being the three which require more cores than I have locally. I guess the failure list above may be similar problems with Actions resources. I'll aim to get a more generously resourced test up and running with a self-hosted builder.

Bottom line though is that this looks to be in decent shape other than resourcing.

tmbgreaves commented 3 years ago

In 7419cb0 I've added in medium testing for focal, and have connected the existing buildbot/Jenkins workers to github actions as self-hosted runners. I've also attempted to separate out the JUnit reporting with the goal of being able to get more useful pass/fail status on jobs - this is still a work in progress.

In 867c69a the tests which were in short but were failing on actions due to insufficient resourcing are moved over to medium testing. Three of these were over the limit that I think we had before of 'up to four cores' for short tests, the rest needed three or four cores, but GHA only appears to provide two. I think given the small number of tests this affects it's probably reasonable to change the short testing limit to two cores, so short testing can run on a github-hosted agent, but would welcome thoughts on this.

tmbgreaves commented 3 years ago

On a second round of sweeping >2 core tests into medium, all the groovy and focal failures in Actions appear to be a result of requesting more cores than available on github agents.

Now moving on to looking at how to get mediumtesting working on Actions.

tmbgreaves commented 3 years ago

Some progress - I now have mediumtesting switched on for focal and groovy, but missed that timeouts need to be explicitly increased on self-hosted builders and timed out again overnight. Running again now so hoping to have a final list of failing tests by the end of the day (wildly optimistic timescales 'r' us).

The main remaining bug I can see at the moment is the failed build on Bionic with the H5 change.

One additional formatting job on my list - it looks like JUnit test reports only report on one of the workflows, which may make a case for condensing the three Ubuntu builds into one workflow.

tmbgreaves commented 3 years ago

@jrper - when you are looking at this again, might you please be able to take a look at the one warning test belonging to you that we have in focal/groovy? It is returning:

check_use_statements: Running warning tests: 
check_use_statements: Running Check_use_statements:
../../preprocessor/synthetic_bc.F90:
  use fldebug
  use spud
  use global_parameters, only: dt, option_path_len
  use elements
  use mpi_interfaces
  use parallel_tools
  use transform_elements
  use fetools
  use fields
  use state_module

check_use_statements: warning.
check_use_statements: W

To reproduce:

docker run -it fluidity/actions:focal-f697238c600f02ee2ae923fefa055ab5297fa624 bash -il
./bin/testharness -f check_use_statements.xml
tmbgreaves commented 3 years ago

Actions now combined into a single workflow which (hopefully) sorts out problems with JUnit reporting. First example that's actually using it is : https://github.com/FluidityProject/fluidity/actions/runs/722830326

Assuming that completes successfully I'll look at squashing down the long list of development commits in this branch and force-pushing a neat few commits. Please shout if that's going to cause havoc for you with a local working copy of this branch (I'll only squash commits I've made, which are after anyone else's commits, and which don't touch files other people are touching in this branch).

Proposed squashing and rebasing:

pick 8cd03f93f Adding initial CI support for GitHub Actions s 0242daaa7 CI/Actions: Adding a groovy workflow s 78f233c73 CI/Actions: adding focal support, dockerfile updates s 7419cb0e4 CI/Actions: add mediumtesting, separate out junit s 84cea5feb CI/Actions: Fix output location for JUnit xml s b9f0ce3b9 CI/Actions: Moving JUnit publishing into separate stage s 49f0b1ba7 CI/Actions: Fix artifact handling s b59ccafe9 CI/Actions: Updating artifact handling in all workflows s 7ce6c093a CI/Actions: Fixing paths and cleaning workspace s d2d86e970 CI/Actions: Fix broken copy between tests s f697238c6 CI/Actions: Increase time limits on medium testing s 272dd3b8b CI/Actions: Condensing builds and tests into one workflow s ecec4efaf CI/Actions: Fixing branch name s eb582c10e CI/Actions: Always run tests s 9460625ad CI/Actions: Combining testing steps pick 867c69a9f TESTS: change tests using more than 2 cores to medium s 9d0a9da62 TESTS: moving tests from short to medium pick ad0ad4e59 DROP WHEN FIXED: explicitly set FCFLAGS

I had some hesitations about wrapping the Actions config into this branch, but I think it's directly relevant as it adds testing that tests the code changes that are being made here, and also adds testing which will fail before this branch is merged, so merging both at the same time seems necessary.

tmbgreaves commented 3 years ago

First round of mediumtesting complete, the failure list from both focal and groovy is:

Summary of test problems with failures or warnings: Stokes_may_solcx_stepx.xml: PFPPP flredecomp_2d_fieldweighted.xml: F particle_diagnostic_field_from_array.xml: F particle_entrainment_of_dense_layer.xml: FP particle_prescribed_vortex_flow.xml: F particle_rayleigh_taylor_checkpoint.xml: PF particle_rayleigh_taylor_instability.xml: PF particle_rayleigh_taylor_mu10fold.xml: PF particle_stratified_stable_layer.xml: FP particle_stratified_unstable_layer.xml: PF viscosity_2d_p0_parallel_local_assembly.xml: F

Passes: 1643 Failures: 11

I'm running these again locally to check and collate the errors, which I'll refer back to test owners if they look to be output errors or to the issue here if they look to be crashes.

tmbgreaves commented 3 years ago

@Cmath2 - I have eight of what look to my untrained eye to be tolerance failures from the Focal build. Please could you take a look at them? All these can be reproduced locally with:

docker run -it fluidity/actions:focal-f697238c600f02ee2ae923fefa055ab5297fa624 bash -il
./bin/testharness -f <test>.xml

Failures are:

particle_diagnostic_field_from_array.xml: test: abs(Max_conservation_error).max() < 5.0e-3 actual value: Max_conservation_error = -0.034805873967044446

particle_entrainment_of_dense_layer.xml: test: assert abs(Max_rms_error).max() < 1.0e1 actual value: Max_rms_error = 90.25800673280196

particle_prescribed_vortex_flow.xml: test: assert abs(Max_conservation_error).max() < 5.0e-3 actual value: Max_conservation_error = -0.034805873967044446

particle_rayleigh_taylor_checkpoint.xml: test: assert sqr_ent_error < 1.0e-3 actual value: Square Ent error 0.0015758734468341866

particle_rayleigh_taylor_instability.xml: test: assert sqr_ent_error < 1.0e-3 actual value: Square Ent error 0.006204633086326918

particle_rayleigh_taylor_mu10fold.xml: test: assert sqr_ent_error < 1.0e-3 actual value: Square Ent error 0.00564389399428004

particle_stratified_stable_layer.xml: test: assert abs(Max_rms_error).max() < 1.0e-2 actual value: Max rms error 0.2427288865846151

particle_stratified_unstable_layer.xml: test: assert sqr_ent_error < 1.0e-3 actual value: Square Ent error 0.013363048581355782

tmbgreaves commented 3 years ago

@drhodrid - I have a failure on focal in one of your tests which looks to my untrained eye to be a tolerance failure. Might you be able to take a look please? It can be reproduced locally with:

docker run -it fluidity/actions:focal-f697238c600f02ee2ae923fefa055ab5297fa624 bash -il
./bin/testharness -f Stokes_may_solcx_stepx.xml

Test is: assert(DeltaU_iterations_GAMG < 16) Actual value is: DeltaU_iterations_GAMG = 16

tmbgreaves commented 3 years ago

The last two failures are flredecomp_2d_fieldweighted and viscosity_2d_p0_parallel_local_assembly which are the two 16-core medium tests and are failing to run due to insufficient resources. I guess there's a change in how core oversubscription is handled in focal vs. bionic causing this, but it's not something I know much about - thoughts very much welcomed on this, and also on whether we want 16-core tests in medium or whether they should be moved to long (I'd be in favour of this to aid local debugging...).

stephankramer commented 3 years ago

Since openmpi3 you have to indeed tell it when you are oversubscribing, which you can do with export OMPI_MCA_rmaps_base_oversubscribe=1. However seeing https://github.com/FluidityProject/fluidity/issues/182, we have actually fixed that in testharness, so I wonder why it's not picking that up...

tmbgreaves commented 3 years ago

Having tried that in our containers, it's definitely not being passed through correctly by testharness as if I set it by hand in the environment the oversubscription is allowed. I've set it explicitly in the base containers and will force another build to see if that fixes things.

Confirming that both the 16-way tests pass for me locally.

tmbgreaves commented 3 years ago

gcc10-fixes squash force-pushed to tidy up my long mess of CI commits, and also drop the change of tests from short to medium which is hopefully unnecessary now oversubscription is allowed. Old commit structure retained for the moment in gcc10-fixes-presquash for reference.

tmbgreaves commented 3 years ago

Pulling in another thread of discussion from PR #297 I think the bionic failure we're seeing here is as discussed there - we need to make the H5hut patch conditional on a configure check for HDF5 version.

tmbgreaves commented 3 years ago

Fixed up the H5hut changes into a conditional patching via a check in Fluidity configure ( see 8b2577a ) and dropped the hard-coded changes which were committed to gcc10-fixes. Initial checks suggest this has sorted out the bionic build, but waiting on a full test run to confirm, and check there aren't any other introduced bugs in the bionic build and testing.

tmbgreaves commented 3 years ago

Also nearly missed that there's one sporadically occurring failure in unit testing, test_invert (femtools/tests/test_invert.F90) which at least on focal is giving different results each run. Reproduce with:

docker run -it fluidity/actions:focal-f697238c600f02ee2ae923fefa055ab5297fa624 bash -il
cd femtools/
make libfemtools.a
cp libfemtools.a ../lib
cd tests/
make test_invert
export PYTHONPATH=$PYTHONPATH:/home/fluidity/python/
./test_invert

Out of 100 runs of this test I got failures on 22 of them, normally one invert failing (different ones on different failed runs), occasionally two failing.

It appears to be one of the files from prehistory and hasn't been touched in over a decade now. @jrper @stephankramer or indeed anyone else interested and with the skills to debug it, might you have a moment to look into this one please?

tmbgreaves commented 3 years ago

Noting in passing that on the latest run square-convection-parallel-trivial segfaulted on focal in:

https://github.com/FluidityProject/fluidity/runs/2291981287?check_suite_focus=true

and also on bionic:

https://github.com/FluidityProject/fluidity/runs/2291981207?check_suite_focus=true

but not on groovy for the same workflow. I haven't seen this failure before.

stephankramer commented 3 years ago

Having tried that in our containers, it's definitely not being passed through correctly by testharness as if I set it by hand in the environment the oversubscription is allowed. I've set it explicitly in the base containers and will force another build to see if that fixes things.

Ah, that's probably because, as it appears, it's not actually set anymore in testharness! It seems fedbbde297e3ecc84cd36cf895f5959b85e3cabd has undone the changes in 45276a25805d19095edb8244806e15f267068480 @jrper: is there any reason we shouldn't add that back in?

jrper commented 3 years ago

I believe the two change sets should have been mutually orthogonal (although my memory of incremental changes made 2&1/2 years ago isn't perfect), so if it works in the test containers, then go for it.

jrper commented 3 years ago

@tmbgreaves for the check_use_statements test, just reordering the uses in synthetic_bc.F90 in the way suggested (i.e. moving the use mpi_interfaces up above the reference to parallel_tools will fix it. I'm not sure whether you want me to make the change in a branch, and if so, which.

tmbgreaves commented 3 years ago

Thank you @jrper - I think @stephankramer has beaten us to it in 61e0cdd . If you happen to have any thoughts on the strangeness in test_invert though .... ?

stephankramer commented 3 years ago

Fixed hopefully in b76f1e02013c55cbf34ec60edc17bf75787b543e. Test tried to invert random matrices which could have arbitrary condition number. Not entirely sure why this started happening now: it seems the random seed now actually gets initialised pseudo-randomly so this might lead to more intermittent failures with badly designed tests.

gnikit commented 3 years ago

Not entirely sure why this started happening now

Hi, if I may put my 2 cents in, I think it is mostly related to switching to junit reports for the unittests, so now all that has changed is that we simply get notified for the failure. I went for a bit of dumpster diving on Jenkins and manged to find builds before we switched to junit for unittesting that still failed see master(46) and master(42)

tmbgreaves commented 3 years ago

Brilliant, thanks @stephankramer for the fix and @gnikit for the extra info :-)

In an attempt to avoid being tied to ageing hardware in ESE for medium testing I did some timed testing of all the individual tests and discovered that we had a medium-lengthed example running for three hours, which I've moved to long and which fixed a lot of our medium-test time bloat. This wasn't enough to get us safely inside the 6h time limit for github-hosted runners, so looking again I noticed a significant number of the mediumtests were now running quickly enough to be reasonably reassigned as short tests (sub-300s) which I've moved across and, with a THREADS=2, leaves short tests completing in ~2h and mediumtests completing in ~4h30.

Test lengths reassigned in 022919a (long example) and 756adac (short-runtime tests to short) plus f8a5877 which removes the dependency on self-hosted runners and sets both test runs to THREADS=2.

tmbgreaves commented 3 years ago

Brief note that I'm not going to be looking at this for the next week but will attempt to curate through PR to merge after that.

tmbgreaves commented 3 years ago

@Cmath2 @drhodrid I think we're getting close to the gcc10-fixes being ready for PR - might you be able to take a look at the test failures in your tests flagged above, please? I can relax the tolerances if you're happy for me to. I think these are all failures which were already present in master and exposed by additional testing, not new failures from new code.

drhodrid commented 3 years ago

Hi Tim,

Sorry for being slow - I had a week off last week. I will get to this in the next few days.

Note that Chris has now left us to a position at the BOM, so I’ll likely need to take ownership of his tests and look into the issue.

Best wishes,

Rhod

On 18 Apr 2021, at 11:38 pm, Tim Greaves @.**@.>> wrote:

@Cmath2https://github.com/Cmath2 @drhodridhttps://github.com/drhodrid I think we're getting close to the gcc10-fixes being ready for PR - might you be able to take a look at the test failures in your tests flagged above, please? I can relax the tolerances if you're happy for me to. I think these are all failures which were already present in master and exposed by additional testing, not new failures from new code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FluidityProject/fluidity/issues/308#issuecomment-821993127, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB25UKTGIJYZQHDYWRCNON3TJLOGZANCNFSM4ZZBRNGA.

tmbgreaves commented 3 years ago

Thanks @drhodrid ! Hope you had a good break :-)

That would be great if you could cast an eye over Chris' tests, thank you. Would you like me to bulk-reassign tests owned by him to you?

drhodrid commented 3 years ago

Hi Tim,

Yep - may as well - it will save me the job ;-)

R

tmbgreaves commented 3 years ago

Done - c82076710 . Give me a shout if I've missed any other blocks of tests.

tmbgreaves commented 3 years ago

We're now down to one bug other than Rhod's tests, namely square-convection-parallel-trivial which is segfaulting, but only on bionic, and only on Actions. As close as I can get to the same run on Jenkins passes, and running in the same container as Actions, but locally, the test also passes, so looks like something subtle to do with the actions resources given to the container. Still trying to figure out how to debug this one.