Proposed re-write of tools/testharness.py

Patol75 commented 3 years ago

With this PR, I would like to propose a re-write of Fluidity's test harness. The main ideas are to potentially speed up the execution of the test suites and provide a complete, well-formatted XML test report that includes everything that was re-directed to stdout and stderr during test executions, thereby easing debugging when necessary. I say potentially in regards to speeding up the execution because I am unsure about how GitHub Actions will allocate resources, and this could require some changes. In the proposed implementation, a target is given for the number of logical cores and tests will run concurrently without exceeding that target, except in the case of a test that requires more cores than the target for which oversubscribing is allowed (one oversubscribed test at a time).

I have combined tools/testharness.py and python/fluidity/regressiontest.py into a single tools/testharness.py which handles the whole execution. I believe the file is clearly written, and I have added comments to improve readability - nevertheless, more could be needed.

I have modified a few tests which reported failures that are, I believe, mostly linked to a forgotten clean-up.

tests/mms_les_second_order_p1dgp2 looks for a file called MMS_D.flml, which I believe is not part of the test anymore
tests/square-convection re-defines max in some Python tests, which leads to failures in the proposed implementation
tests/wetting_and_drying_balzano_cg call a Python script named plotfs_detec.py, which I believe is not necessary anymore following the new HDF5 implementation

I have modified Makefile.in and .github/workflows/ubuntu.yml to account for the changes made to tools/testharness.py - hopefully, nothing problematic there.

I attach here an example XML output produced running tools/testharness.py -n 8 -e exodusii -x tests_results.xml. Two test failures were reported, but I believe they are platform-dependent as they do not occur on Actions, though I have not investigated.

From Actions results:

I have added the test name in all error messages so that it is clear right away from the Summary tab which test failed.
I have used action="append" instead of nargs="*" to properly handle tags supplied through the Makefile.
Longtests are failing because of the length_condition defined on line 79. I have changed the way the length argument is dealt with, so they should all run now. Additionally, particle_stratified_stable_layer is running because its length is "medium" which, I believe, is a mistake.

The above should yield a green from Actions, except if something arises in longtests (untested on my side). However, I am not committing yet to avoid another Actions run.

Unfortunately, there was no speed up. @tmbgreaves any idea if there is a way to request a given amount of logical cores from Actions and use them in a given job? The short suite completes in less than an hour locally for me.

Having a further look at Actions, it seems to me that concurrency is only possible through parallel jobs, and not through parallel tasks within a job. If this is correct, then the only way to speed up test suites is to execute a single test at a time, as is done for longtests.

This actually might be possible (see here). Say, in tools/testharness.py, I add --github as a command-line argument and use it in the following way:

if args.just_list:
    print("*** Found tests that match the input criteria")
    for test_xml in sorted(tests.keys()):
        print(f"\t-> {test_xml.stem}")
    if args.github:
        import json
        with open("github_tests.json", "w") as fid:
            json.dump([test.name for test in tests.keys()], fid)

Then I have a JSON file at fluidity's root that contains all the tests that need to be run. In the YAML, we could have the following jobs: building, unittest, reading-JSON, test (both short and medium), longtest (both long and vlong). The idea is that in "reading-JSON", we use the outputs capability from Action:

outputs:
  matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
  - name: Set matrix for build
      id: set-matrix
      run: |
        bin/testharness --just-list --github
        echo "::set-output name=matrix::$( echo "$(cat github_tests.json)" )"

And in "test":

needs: reading-JSON
strategy:
  matrix: ${{fromJson(needs.reading-JSON.outputs.matrix)}}

which automatically generates the matrix as is done manually for longtests at the moment. I would be happy to try this and see if that gives a speedup. However, I am very unfamiliar with Docker and I am unsure of how to proceed. @tmbgreaves do you think that sounds possible?

As a proof of concept, the latest Actions workflow has successfully passed Building, Unittesting and Mediumtesting on Bionic, Focal and Groovy in just over two hours. In comparison, the same load took about five hours in the previous successful Actions workflow. However, it is not clear to me if the same benefit can be obtained for Shorttesting as the overhead before the actual test executes will be too big (for example, see manual.xml in the latest workflow). Additionally, there are over 300 short tests, and the maximum amount of jobs allowed for a given matrix is 256. Therefore, it is potentially better to just execute them sequentially through a single call to testharness (passing -n 0 should do the trick). For Longtesting, the strategy should stay the same, except that we avoid explicitly writing them one by one in the YAML. This being said, it looks like there are a few failures (e.g. top hat), so I will need to have a look. Finally, XML reports could potentially be fused to avoid having so many of them if the proposed strategy is to be kept, but I do not think it is a requirement.

Based on the latest workflow, I have compiled a textfile that lists all tests based on how long they took to run. For each test, I have included the folder where the test is located and the current length given to that test. Potentially, we would want to re-assign some tests, for example not to have any medium test faster than a short test (and all the other possibilities). Something to have in mind is that execution is fast on Actions: a given test takes three times longer to run on my desktop machine (for example, popbal_nonhomog2_2d_dg_p2, 6127 seconds on my machine, 2170 seconds on Actions). As a suggestion, we could have:

vlong -> more than 3 hours (10,800 seconds), about 30 tests; excluded from Actions
long -> more than 30 minutes (1,800 seconds), about 41 tests; only tested on focal, one test at a time as in the current longtest matrix
medium -> more than 3 minutes (180 seconds), about 96 tests; tested on bionic, focal and groovy, one test at a time as in the current mediumtest matrix (though, in that case, we can only have up to 85 tests otherwise the 256 jobs per matrix limit will be exceeded, or we remove one Ubuntu release)
short -> more than 10 seconds, about 156 tests; tested on bionic, focal and groovy, altogether as in the current shorttest matrix
vshort -> remaining tests, about 184 tests; tested on bionic, focal and groovy, altogether as in the current shorttest matrix

Additionally, lock_exchange_3d_parallel and restratification_after_oodc yield Exit code 137, which I have not looked at yet. The example lock_exchange is failing because of le_tools.py - I have fixed it and it should take about 2600 seconds to run. saltfinger2d_adaptive fails while meshing - I cannot reproduce the failure locally (gmsh 3.0.6 and 4.7.1). particle_stratified_stable_layer does not fail (see mistake mentioned above) - from a previous run, it completes in 11708.586653981 seconds (should be moved to vlong according to the above).

Finally, GitHub says we are running out of disk space as, I believe, the Actions artifacts accumulate. Older ones should be deleted. Also, on an unrelated note, regarding Ubuntu releases, do we plan on replacing groovy by hirsute?

tmbgreaves commented 3 years ago

Many thanks for this and apologies for the slow response - I'll do my best to look through it and comment by the end of the week!

Patol75 commented 3 years ago

@tmbgreaves Thank you very much for your review, and thank you for explaining what special tests actually are - I had no idea. On these particular tests, if they are to be removed, I will be happy to make the changes you highlighted. Thank you as well for pointing out the change of strategy regarding any failure in the building jobs. I feel like it is preferable to entirely fail if any build fails because, in any case, we would have to re-run everything after the build is fixed. But, indeed, to be discussed.

I have cancelled the Actions workflow you restarted as it will not be useful in the current circumstances - sorry about the confusion. To understand why, please have a look at the latest update of the OP in this PR. I have kept updating it after getting results from Actions (potentially not best practice, but I did not want to flood the PR with lots of comments). I would actually be grateful to have your thoughts on what I am proposing there.

There are currently no intentionally failing tests. I have had a play locally with all the various kind of failures that I could think of, and it seemed to me everything was behaving properly. I could add some failures to illustrate the results, but it would potentially be preferable that a reviewer does it as they could come up with failure ideas I could have overlooked. Additionally, some Actions workflows returned failures and, thereby, provide examples of the current strategy used to deal with them.

Thank you again for your review.

tmbgreaves commented 3 years ago

Thanks @Patol75 - now gone back and reread the OP :-)

Comments on thoughts there:

I'm not aware of any way to request specific core counts from Actions
A general reassignment of test lengths with your definitions sounds great to me and I'd be very much in favour of that change, particularly if it led to being able to run all the suites through your new scheduler. Agreed the lengths are way off at the moment in many cases; that had been on my list to rework once all tests were passing, so I'd be very happy for you to include it in this PR :-)
particle_stratified_stable_layer being on the longtest exclusion list was indeed a mistake on my part
Thanks for note on saltfinger2d_adaptive - I'll try switching out the gmsh version at some point to see if the error goes away
I'm neutral on hirsute - goovy was included to test gcc10, rather than to track latest release, as we'd had a general policy of only building for LTS releases up to that point. I think hirsute has buggy PETSc for our purposes (@stephankramer reported and had fixed a bug that hit in 14.6(?) and IIRC was set to be fixed in 16) so I'd been somewhat hoping to skip over to a version that had the bug fixed rather than go through the hoops of backporting fixes to the hirsute package. However, at some point in the not too distant future groovy support will end, presumably before 'J' is in early development, so if we want to keep testing gcc10 we might want to go to hirsute, or possibly track pre-release impish?
I've not entirely understood what's going on with disk space and artifacts - are they cumulative and collectively downloaded to a worker when a run starts? Suggestions for storage management if so would be gratefully received. I'd been assuming that with a pubic repo this wasn't something we had to worry about -- apparently I was being naive there!

Patol75 commented 3 years ago

Thanks again :)

I have looked through all the Actions doc I could find and I came to the same conclusion, hence why I have tried to go with matrix concurrency instead. The caveat is that each job has to go through docker pull, which is problematic if the test in itself executes faster than the latter call. I still do not understand what the behaviour is when one process within a job calls mpirun. For example, in the latest workflow, short tests were running serially through a single call to testharness, and that took about twice as long as when, in the previous implementation, we had THREADS=2 or, in the proposed implementation, -n 8. Potentially, this is 2-way SMT and, as a result, only two tests can be handled at the same time. What I understand is that 20 jobs can run concurrently, there is a limit of 256 jobs per matrix and one job cannot excess 6 real hours runtime, but in terms of actual logical processor cores, this is still confusing to me. PS: confirmed through this doc page - 2-core CPU. I am astonished 48-core jobs even start on such a CPU.
Re-assigning the lengths is quite trivial through Python, so I would be happy to do it once we settle on the time boundaries. After that, I guess it is just some moving around between /tests and /longtests, so it would require a PR in longtests as well? About running all suites, would we want to run vlong tests as well? What about those that exceed 6 hours?
particle_stratified_stable_layer: it is actually in the right folder, we just need its tag to change.
saltfinger2d_adaptive: Are we running on older gmsh versions?
I see. What I was thinking is that we could only support two Ubuntu releases. For example, at the moment, support 20.04 and 21.04, then when 22.04 is out, support 20.04 and 22.04, then when 23.04 is out, 22.04 and 23.04, etc... This way, there would be a yearly update. However, it is very true that it is probably not worth the effort if dependencies have bugs in them.
There is actually an artifact and log retention policy, with 500 MB available space from my understanding. See here. Our retention period is likely too long: we should decrease it to, say, somewhere between a couple of weeks and a month. I do not think I have administrative rights to do it, though.

Additionally, I have noticed that, following the cancelled Actions workflow, the output archive has been overwritten. Fortunately, I had it downloaded. Please access it here.

tmbgreaves commented 3 years ago

GMSH version: As far as I know it's pretty recent but I'll check on my next docker base container rebuild
VLong tests: These are running through the longtests repo as they're running on self-hosted hardware (to get round the 6h time limit on hosted hardware) and mostly from security considerations I'd be keen to avoid them being able to be run by potentially arbitrary code on PRs
Log retention I've set to 14 days now. Good catch, thank you!
I'm also impressed at how much we get away with running on 2-core nodes :-)

tmbgreaves commented 3 years ago

Thank you @Patol75 :-)

One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo.

Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo?

Thoughts very much welcomed on this :-)

drhodrid commented 3 years ago

Hi Tim,

I don’t see any harm in moving some of the long tests back in to the main repo, provided they remain tagged as “long” - it’s always nice to be able to run the “short” and “medium” test suites on local systems.

Out of interest, which long tests are excessively large? Are there obvious reasons (for example meshes have been committed rather than just .geo files)? How many of these can be trimmed?

R

On 23 Jun 2021, at 5:54 pm, Tim Greaves @.**@.>> wrote:

Thank you @Patol75https://github.com/Patol75 :-)

One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo.

Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo?

Thoughts very much welcomed on this :-)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/FluidityProject/fluidity/pull/326#issuecomment-866615733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB25UKUDHBJYAU5NKEHM443TUGHJXANCNFSM46TGJYBA.

tmbgreaves commented 3 years ago

Agreed, @drhodrid - definitely keen to keep the 'long' tags to ensure short/medium runs are manageable.

Here are the >1MB results from tests/ and a size-sorted list from longtests: https://pastebin.com/gmapxhMX

I'd propose at minimum moving all the <1MB tests from longtests/ to tests/ , and possibly renaming the 'longtests' repo to 'extratests' or something like that, to avoid confusion?

Patol75 commented 3 years ago

@tmbgreaves @drhodrid Thank you for this additional information, I will keep/put all vshort, short and medium tests in /tests, and do the same for small long tests - 1 MB seems like a reasonable threshold. On second thoughts, what about putting all tests larger than 500 KB in the to-be-renamed longtests repo? Judging from the output of du -sh * | sort -h while in /tests and /longtests, this corresponds to 74 tests, as opposed to 52 tests larger than 1 MB. I agree that would take away some shorter-length tests, but if this additional repository is here to reduce the size of the main repository, well that would do it. It is worth to say that /tests/data is 34 MB, but I guess we have to live with this one.

Regarding the length re-assigning I did yesterday, it triggered a few interesting events. First, as I did not use lxml, quite a few comments got lost, which likely means I will have to revert the commit and do the re-assigning again, without losing the comments this time. Then, following the commit, the Short tests jobs would not complete on Action. The reason for this is that testharness was hanging in the pattern matching strategy included in the process_error function. What happened is that diamond_validation reported failures following the length re-assigning. Failure processing in the proposed implementation involves parsing stdout using a regex. However, stdout as yielded by diamond_validation is lengthy - I will include a re-write of the XML for this test to limit the generated output while keeping relevant information. It turned out that the regex I had designed was poorly equipped to deal with lengthy outputs, hence why testharness was hanging. I have, therefore, updated the regex to make it more efficient. It seems that it worked, as the current Actions workflow completed all three Short tests jobs and reported each time that diamond_validation failed (I will look into what causes the Annotations entries to display only diamond_validation). Looking at the output, indeed validation failures are reported for a number of XML files. However, what leaves me puzzled is that these tests for which XML validation fails actually succeed when executed through testharness. Is that the expected behaviour? Or is there a further problem in diamond_validation apart from the lengthy output?

To be more precise about the validation failures, the actual reason for failures is the attribute valid of the first item in the list returned by the read method of the Schema class in libspud/diamond/diamond/schema.py. The attribute is False for all failing XML files.

Patol75 commented 3 years ago

I think @drhodrid is potentially right regarding why some tests have a large directory size. Many of the tests in longtests that I moved to tests for a test-run on Actions failed because some of their input files had been cleaned up. Additionally, python_validation reported quite a few failures for some of them too. From my understanding, all of these tests are tests we intend to keep in Fluidity and, therefore, they should be fixed.

To make things easier, I have pushed another branch called testharnessClean where I have only put essential changes made so far. This is much easier to look at than having to go through all the commits of this PR. Again, from my understanding, there are still some steps to undertake:

Re-assign test lengths for all tests (examples, tests, longtests) using lxml not to lose any comments.
Trim tests that are too large where possible.
Move tests between tests and longtests.
Fix tests that fail in python_validation.

Depending on how much tests can be trimmed, the longtests repository might not be needed anymore. If it is still needed, then @tmbgreaves idea of renaming it makes sense: extratests, largetests, or anything else that is appropriate.

Additionally, there is still the question: is the fact that some XML fail in diamond_validation but execute properly through testharness the expected behaviour?

On testharnessClean, I have copied all tests from longtests inside tests and correctly re-assigned all test lengths (at least, this time, I hope so!). This is currently running on Actions. If results are good, then the first bullet point above will be done.

Patol75 commented 3 years ago

I have made further progress in the testharnessClean branch. Excluding the now 35 vlong tests, the whole test suite takes about 6 hours to run on Actions. I have improved the way errors generated through Python tests are dealt with in testharness, and it looks robust now. Nonetheless, the short-tests job on bionic yielded an error in testharness, which did not occur in the focal and groovy jobs, and which I did not manage to reproduce locally. I guess I will have to trigger a specific Actions run with some print statements to figure out what is going on there. Other than that, many tests actually fail, as summarised in the Annotations panel of the workflow linked above. I think these are legitimate failures that need to be addressed.

Regarding the bullet points I mentioned in the above message, length re-assigning is done, but might need further tuning when all tests actually pass. I believe no comments have been lost in the XML files this time, but I have removed the DOCTYPE entries as they were pointing to non-existing files, and likely unnecessary. Additionally, I have removed some commited MSH files and modified the associated Makefiles to generate meshes during test execution. The latter action came as a result of copying all longtests directories into tests and committing them, which by default excludes files with a .msh extension. Nonetheless, quite a few tests remain large and need attention. To know which ones, simply check out testharnessClean, cd into tests and execute du -sh * | sort -h.

Finally, I will probably not be able to figure out all the failures by myself, so help would be appreciated.

Patol75 commented 3 years ago

An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included gmsh instructions in the Makefile. I have played around with mms_tracer_P1dg_cdg_diff_steady, for which the first element of ab_convergence_gal_stat must be between 1.8 and 2.2 for the first Pass test to yield success. When using the already commited MSH files, the value is 1.88. When generating the MSH files from the commited GEO files, the value is always lower than 1.8 (e.g. 1.75 or 1.68, depending on the GMSH version used). Visualising the MSH files, the only obvious difference I could spot is that all the commited files have a smaller amount of nodes than the generated ones. To give another example, the commited MSH file of viscosity_2d_p0_adaptive_parallel has a 20x11 regular grid node structure, while generating the MSH file through GMSH yields a 21x11 regular grid node structure. Am I missing something here in how to generate the MSH files?

tmbgreaves commented 3 years ago

Apologies for being silent on this one for so long, @Patol75 - chaos this end on multiple levels has pulled me away from Fluidity work. Currently aiming to refocus on Fluidity mid-September, assuming other work packages go as expected.

adigitoleo commented 3 years ago

An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included gmsh instructions in the Makefile. [...]

Small comment on this: I've experienced significantly different mesh outputs for the same instructions across gmsh "minor" version changes, e.g. 4.4 vs 4.8. In other words, gmsh doesn't really use semantic versioning. Maybe the .msh files were created using a slightly older gmsh version? Another reason against committing .msh files

Patol75 commented 2 years ago

Thank you for looking into these proposed changes to the testing script, @stephankramer.

Unfortunately, the changes here are rather messy, and I had kept this PR opened mainly for discussion. I had created another branch that would likely be a better candidate for a merge, even though it is now quite outdated. The main problem that was left in that other branch was to decide what to do with those tests that had their MSH file(s) committed; most of them were hosted in the longtests repository. I guess there are three options: (i) keep the committed MSH files and potentially rename longtests to largetests, given that it would be hosting all these tests that result in a large directory size; (ii) generate GEO files that can reproduce meshes described by the MSH files; (iii) drop tests that have MSH files committed. A combination of (i) and (ii) is probably what we should aim for, but happy to have your thoughts about it.

Regarding testharness.py, given that it is basically the same file in both branches, I will address your reviewing comments and include some changes I made for the CMake branch here. It will be easy to move it around to any branch.

stephankramer commented 2 years ago

Yeah I agree that we should probably go for a combination of (i) and (ii) (and possibly (iii) :-)) This is also ties into the fact that we need to upgrade to some newer version of gmsh (current base images in which the tests are run all have gmsh 2.6 hard-coded). This is an old discussion and comes back to the question to how exactly a test should be reproducible/deterministic. On the one hand you would like to see any unintended consequence of a commit, so by that school of thought even if an answer changes just a little bit the test should fail. This can only be achieved by having exactly the same mesh as input (i.e. commit them) as the meshing process in general is not reproducible. But then we also want to test the meshing step as part of the CI (in an integrated sense!), so there you would have tests only fail if the answer changes significantly . Trouble is these tests can be hard to write. So as a sort of compromise we have tended to go for a mix of committed and non-committed meshes. But yes, size of these meshes is another practical concern. I think what you and @tmbgreaves suggested, to integrate most of the long tests back into main repo, so that they can be scheduled in the most efficient way, and only have some exceptional ones that are either very long or have large meshes - we could also run these less frequently (as we currently do with vlong? and/or drop some of them) - sounds like a good plan. One thing to keep in mind, if you do (accidentally) end up putting some very large files back in, even on a branch, and want to undo that later, make sure to overwrite history, otherwise everyone is still downloading it when they clone.

Patol75 commented 2 years ago

Thanks again for all the reviews; the code looks better now. Happy to get more feedback if you have time.

Regarding the version of GMSH used in the CI, I have actually used the latest GMSH in the CMake branch for all tests, and it is true that a large number of tests reported failures. Some of them behaved as expected when -format msh2 was additionally supplied to the GMSH command line, but only a few. At that point, from an outsider's perspective, it is hard to judge if the Python variable values obtained through the test are still appropriate, in which case the thresholds can be updated, or if they are not. In the latter situation, once again, it is not straightforward to determine if the test was not robust enough against mesh changes or if something failed silently in the building/execution process.

Ideally, I think we would want to have only GEO files and accept that answers can change, whilst also having tests sufficiently robust so that answers would not change too much. The problem here is that it could mean quite a lot of work for these tests that currently fail with the newer versions of GMSH. I am saying could because a lot of those tests look very similar to me; there could be a common strategy to make them all more robust.

My experience is that only very few tests currently lack a GEO file, and an additional very few fail to run on Actions because of resource limits (time and/or number of CPU cores). These would indeed be good candidates for the alternative repository, although some committed MSH files are not that big. Moreover, other tests have committed files other than MSH that are quite big, so they could also be legitimate candidates.

Finally, regarding your comment about pushing large files, I think I have three branches where it most likely happened: testharness, testharnessClean and cmake_build_2022. Would deleting the branches be sufficient when changes are incorporated into main (or discarded)?

FluidityProject / fluidity

Proposed re-write of tools/testharness.py #326