FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
365 stars 115 forks source link

Proposed re-write of tools/testharness.py #326

Open Patol75 opened 3 years ago

Patol75 commented 3 years ago

With this PR, I would like to propose a re-write of Fluidity's test harness. The main ideas are to potentially speed up the execution of the test suites and provide a complete, well-formatted XML test report that includes everything that was re-directed to stdout and stderr during test executions, thereby easing debugging when necessary. I say potentially in regards to speeding up the execution because I am unsure about how GitHub Actions will allocate resources, and this could require some changes. In the proposed implementation, a target is given for the number of logical cores and tests will run concurrently without exceeding that target, except in the case of a test that requires more cores than the target for which oversubscribing is allowed (one oversubscribed test at a time).

I have combined tools/testharness.py and python/fluidity/regressiontest.py into a single tools/testharness.py which handles the whole execution. I believe the file is clearly written, and I have added comments to improve readability - nevertheless, more could be needed.

I have modified a few tests which reported failures that are, I believe, mostly linked to a forgotten clean-up.

I have modified Makefile.in and .github/workflows/ubuntu.yml to account for the changes made to tools/testharness.py - hopefully, nothing problematic there.

I attach here an example XML output produced running tools/testharness.py -n 8 -e exodusii -x tests_results.xml. Two test failures were reported, but I believe they are platform-dependent as they do not occur on Actions, though I have not investigated.


From Actions results:

The above should yield a green from Actions, except if something arises in longtests (untested on my side). However, I am not committing yet to avoid another Actions run.

Unfortunately, there was no speed up. @tmbgreaves any idea if there is a way to request a given amount of logical cores from Actions and use them in a given job? The short suite completes in less than an hour locally for me.


Having a further look at Actions, it seems to me that concurrency is only possible through parallel jobs, and not through parallel tasks within a job. If this is correct, then the only way to speed up test suites is to execute a single test at a time, as is done for longtests.


This actually might be possible (see here). Say, in tools/testharness.py, I add --github as a command-line argument and use it in the following way:

if args.just_list:
    print("*** Found tests that match the input criteria")
    for test_xml in sorted(tests.keys()):
        print(f"\t-> {test_xml.stem}")
    if args.github:
        import json
        with open("github_tests.json", "w") as fid:
            json.dump([test.name for test in tests.keys()], fid)

Then I have a JSON file at fluidity's root that contains all the tests that need to be run. In the YAML, we could have the following jobs: building, unittest, reading-JSON, test (both short and medium), longtest (both long and vlong). The idea is that in "reading-JSON", we use the outputs capability from Action:

outputs:
  matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
  - name: Set matrix for build
      id: set-matrix
      run: |
        bin/testharness --just-list --github
        echo "::set-output name=matrix::$( echo "$(cat github_tests.json)" )"

And in "test":

needs: reading-JSON
strategy:
  matrix: ${{fromJson(needs.reading-JSON.outputs.matrix)}}

which automatically generates the matrix as is done manually for longtests at the moment. I would be happy to try this and see if that gives a speedup. However, I am very unfamiliar with Docker and I am unsure of how to proceed. @tmbgreaves do you think that sounds possible?


As a proof of concept, the latest Actions workflow has successfully passed Building, Unittesting and Mediumtesting on Bionic, Focal and Groovy in just over two hours. In comparison, the same load took about five hours in the previous successful Actions workflow. However, it is not clear to me if the same benefit can be obtained for Shorttesting as the overhead before the actual test executes will be too big (for example, see manual.xml in the latest workflow). Additionally, there are over 300 short tests, and the maximum amount of jobs allowed for a given matrix is 256. Therefore, it is potentially better to just execute them sequentially through a single call to testharness (passing -n 0 should do the trick). For Longtesting, the strategy should stay the same, except that we avoid explicitly writing them one by one in the YAML. This being said, it looks like there are a few failures (e.g. top hat), so I will need to have a look. Finally, XML reports could potentially be fused to avoid having so many of them if the proposed strategy is to be kept, but I do not think it is a requirement.


Based on the latest workflow, I have compiled a textfile that lists all tests based on how long they took to run. For each test, I have included the folder where the test is located and the current length given to that test. Potentially, we would want to re-assign some tests, for example not to have any medium test faster than a short test (and all the other possibilities). Something to have in mind is that execution is fast on Actions: a given test takes three times longer to run on my desktop machine (for example, popbal_nonhomog2_2d_dg_p2, 6127 seconds on my machine, 2170 seconds on Actions). As a suggestion, we could have:

Additionally, lock_exchange_3d_parallel and restratification_after_oodc yield Exit code 137, which I have not looked at yet. The example lock_exchange is failing because of le_tools.py - I have fixed it and it should take about 2600 seconds to run. saltfinger2d_adaptive fails while meshing - I cannot reproduce the failure locally (gmsh 3.0.6 and 4.7.1). particle_stratified_stable_layer does not fail (see mistake mentioned above) - from a previous run, it completes in 11708.586653981 seconds (should be moved to vlong according to the above).

Finally, GitHub says we are running out of disk space as, I believe, the Actions artifacts accumulate. Older ones should be deleted. Also, on an unrelated note, regarding Ubuntu releases, do we plan on replacing groovy by hirsute?

tmbgreaves commented 3 years ago

Many thanks for this and apologies for the slow response - I'll do my best to look through it and comment by the end of the week!

Patol75 commented 3 years ago

@tmbgreaves Thank you very much for your review, and thank you for explaining what special tests actually are - I had no idea. On these particular tests, if they are to be removed, I will be happy to make the changes you highlighted. Thank you as well for pointing out the change of strategy regarding any failure in the building jobs. I feel like it is preferable to entirely fail if any build fails because, in any case, we would have to re-run everything after the build is fixed. But, indeed, to be discussed.

I have cancelled the Actions workflow you restarted as it will not be useful in the current circumstances - sorry about the confusion. To understand why, please have a look at the latest update of the OP in this PR. I have kept updating it after getting results from Actions (potentially not best practice, but I did not want to flood the PR with lots of comments). I would actually be grateful to have your thoughts on what I am proposing there.

There are currently no intentionally failing tests. I have had a play locally with all the various kind of failures that I could think of, and it seemed to me everything was behaving properly. I could add some failures to illustrate the results, but it would potentially be preferable that a reviewer does it as they could come up with failure ideas I could have overlooked. Additionally, some Actions workflows returned failures and, thereby, provide examples of the current strategy used to deal with them.

Thank you again for your review.

tmbgreaves commented 3 years ago

Thanks @Patol75 - now gone back and reread the OP :-)

Comments on thoughts there:

Patol75 commented 3 years ago

Thanks again :)

Additionally, I have noticed that, following the cancelled Actions workflow, the output archive has been overwritten. Fortunately, I had it downloaded. Please access it here.

tmbgreaves commented 3 years ago
tmbgreaves commented 3 years ago

Thank you @Patol75 :-)

One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo.

Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo?

Thoughts very much welcomed on this :-)

drhodrid commented 3 years ago

Hi Tim,

I don’t see any harm in moving some of the long tests back in to the main repo, provided they remain tagged as “long” - it’s always nice to be able to run the “short” and “medium” test suites on local systems.

Out of interest, which long tests are excessively large? Are there obvious reasons (for example meshes have been committed rather than just .geo files)? How many of these can be trimmed?

R

On 23 Jun 2021, at 5:54 pm, Tim Greaves @.**@.>> wrote:

Thank you @Patol75https://github.com/Patol75 :-)

One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo.

Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo?

Thoughts very much welcomed on this :-)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/FluidityProject/fluidity/pull/326#issuecomment-866615733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB25UKUDHBJYAU5NKEHM443TUGHJXANCNFSM46TGJYBA.

tmbgreaves commented 3 years ago

Agreed, @drhodrid - definitely keen to keep the 'long' tags to ensure short/medium runs are manageable.

Here are the >1MB results from tests/ and a size-sorted list from longtests: https://pastebin.com/gmapxhMX

I'd propose at minimum moving all the <1MB tests from longtests/ to tests/ , and possibly renaming the 'longtests' repo to 'extratests' or something like that, to avoid confusion?

Patol75 commented 3 years ago

@tmbgreaves @drhodrid Thank you for this additional information, I will keep/put all vshort, short and medium tests in /tests, and do the same for small long tests - 1 MB seems like a reasonable threshold. On second thoughts, what about putting all tests larger than 500 KB in the to-be-renamed longtests repo? Judging from the output of du -sh * | sort -h while in /tests and /longtests, this corresponds to 74 tests, as opposed to 52 tests larger than 1 MB. I agree that would take away some shorter-length tests, but if this additional repository is here to reduce the size of the main repository, well that would do it. It is worth to say that /tests/data is 34 MB, but I guess we have to live with this one.

Regarding the length re-assigning I did yesterday, it triggered a few interesting events. First, as I did not use lxml, quite a few comments got lost, which likely means I will have to revert the commit and do the re-assigning again, without losing the comments this time. Then, following the commit, the Short tests jobs would not complete on Action. The reason for this is that testharness was hanging in the pattern matching strategy included in the process_error function. What happened is that diamond_validation reported failures following the length re-assigning. Failure processing in the proposed implementation involves parsing stdout using a regex. However, stdout as yielded by diamond_validation is lengthy - I will include a re-write of the XML for this test to limit the generated output while keeping relevant information. It turned out that the regex I had designed was poorly equipped to deal with lengthy outputs, hence why testharness was hanging. I have, therefore, updated the regex to make it more efficient. It seems that it worked, as the current Actions workflow completed all three Short tests jobs and reported each time that diamond_validation failed (I will look into what causes the Annotations entries to display only diamond_validation). Looking at the output, indeed validation failures are reported for a number of XML files. However, what leaves me puzzled is that these tests for which XML validation fails actually succeed when executed through testharness. Is that the expected behaviour? Or is there a further problem in diamond_validation apart from the lengthy output?


To be more precise about the validation failures, the actual reason for failures is the attribute valid of the first item in the list returned by the read method of the Schema class in libspud/diamond/diamond/schema.py. The attribute is False for all failing XML files.

Patol75 commented 3 years ago

I think @drhodrid is potentially right regarding why some tests have a large directory size. Many of the tests in longtests that I moved to tests for a test-run on Actions failed because some of their input files had been cleaned up. Additionally, python_validation reported quite a few failures for some of them too. From my understanding, all of these tests are tests we intend to keep in Fluidity and, therefore, they should be fixed.

To make things easier, I have pushed another branch called testharnessClean where I have only put essential changes made so far. This is much easier to look at than having to go through all the commits of this PR. Again, from my understanding, there are still some steps to undertake:

Depending on how much tests can be trimmed, the longtests repository might not be needed anymore. If it is still needed, then @tmbgreaves idea of renaming it makes sense: extratests, largetests, or anything else that is appropriate.

Additionally, there is still the question: is the fact that some XML fail in diamond_validation but execute properly through testharness the expected behaviour?


On testharnessClean, I have copied all tests from longtests inside tests and correctly re-assigned all test lengths (at least, this time, I hope so!). This is currently running on Actions. If results are good, then the first bullet point above will be done.

Patol75 commented 3 years ago

I have made further progress in the testharnessClean branch. Excluding the now 35 vlong tests, the whole test suite takes about 6 hours to run on Actions. I have improved the way errors generated through Python tests are dealt with in testharness, and it looks robust now. Nonetheless, the short-tests job on bionic yielded an error in testharness, which did not occur in the focal and groovy jobs, and which I did not manage to reproduce locally. I guess I will have to trigger a specific Actions run with some print statements to figure out what is going on there. Other than that, many tests actually fail, as summarised in the Annotations panel of the workflow linked above. I think these are legitimate failures that need to be addressed.

Regarding the bullet points I mentioned in the above message, length re-assigning is done, but might need further tuning when all tests actually pass. I believe no comments have been lost in the XML files this time, but I have removed the DOCTYPE entries as they were pointing to non-existing files, and likely unnecessary. Additionally, I have removed some commited MSH files and modified the associated Makefiles to generate meshes during test execution. The latter action came as a result of copying all longtests directories into tests and committing them, which by default excludes files with a .msh extension. Nonetheless, quite a few tests remain large and need attention. To know which ones, simply check out testharnessClean, cd into tests and execute du -sh * | sort -h.

Finally, I will probably not be able to figure out all the failures by myself, so help would be appreciated.

Patol75 commented 3 years ago

An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included gmsh instructions in the Makefile. I have played around with mms_tracer_P1dg_cdg_diff_steady, for which the first element of ab_convergence_gal_stat must be between 1.8 and 2.2 for the first Pass test to yield success. When using the already commited MSH files, the value is 1.88. When generating the MSH files from the commited GEO files, the value is always lower than 1.8 (e.g. 1.75 or 1.68, depending on the GMSH version used). Visualising the MSH files, the only obvious difference I could spot is that all the commited files have a smaller amount of nodes than the generated ones. To give another example, the commited MSH file of viscosity_2d_p0_adaptive_parallel has a 20x11 regular grid node structure, while generating the MSH file through GMSH yields a 21x11 regular grid node structure. Am I missing something here in how to generate the MSH files?

tmbgreaves commented 3 years ago

Apologies for being silent on this one for so long, @Patol75 - chaos this end on multiple levels has pulled me away from Fluidity work. Currently aiming to refocus on Fluidity mid-September, assuming other work packages go as expected.

adigitoleo commented 3 years ago

An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included gmsh instructions in the Makefile. [...]

Small comment on this: I've experienced significantly different mesh outputs for the same instructions across gmsh "minor" version changes, e.g. 4.4 vs 4.8. In other words, gmsh doesn't really use semantic versioning. Maybe the .msh files were created using a slightly older gmsh version? Another reason against committing .msh files

Patol75 commented 2 years ago

Thank you for looking into these proposed changes to the testing script, @stephankramer.

Unfortunately, the changes here are rather messy, and I had kept this PR opened mainly for discussion. I had created another branch that would likely be a better candidate for a merge, even though it is now quite outdated. The main problem that was left in that other branch was to decide what to do with those tests that had their MSH file(s) committed; most of them were hosted in the longtests repository. I guess there are three options: (i) keep the committed MSH files and potentially rename longtests to largetests, given that it would be hosting all these tests that result in a large directory size; (ii) generate GEO files that can reproduce meshes described by the MSH files; (iii) drop tests that have MSH files committed. A combination of (i) and (ii) is probably what we should aim for, but happy to have your thoughts about it.

Regarding testharness.py, given that it is basically the same file in both branches, I will address your reviewing comments and include some changes I made for the CMake branch here. It will be easy to move it around to any branch.

stephankramer commented 2 years ago

Yeah I agree that we should probably go for a combination of (i) and (ii) (and possibly (iii) :-)) This is also ties into the fact that we need to upgrade to some newer version of gmsh (current base images in which the tests are run all have gmsh 2.6 hard-coded). This is an old discussion and comes back to the question to how exactly a test should be reproducible/deterministic. On the one hand you would like to see any unintended consequence of a commit, so by that school of thought even if an answer changes just a little bit the test should fail. This can only be achieved by having exactly the same mesh as input (i.e. commit them) as the meshing process in general is not reproducible. But then we also want to test the meshing step as part of the CI (in an integrated sense!), so there you would have tests only fail if the answer changes significantly . Trouble is these tests can be hard to write. So as a sort of compromise we have tended to go for a mix of committed and non-committed meshes. But yes, size of these meshes is another practical concern. I think what you and @tmbgreaves suggested, to integrate most of the long tests back into main repo, so that they can be scheduled in the most efficient way, and only have some exceptional ones that are either very long or have large meshes - we could also run these less frequently (as we currently do with vlong? and/or drop some of them) - sounds like a good plan. One thing to keep in mind, if you do (accidentally) end up putting some very large files back in, even on a branch, and want to undo that later, make sure to overwrite history, otherwise everyone is still downloading it when they clone.

Patol75 commented 2 years ago

Thanks again for all the reviews; the code looks better now. Happy to get more feedback if you have time.

Regarding the version of GMSH used in the CI, I have actually used the latest GMSH in the CMake branch for all tests, and it is true that a large number of tests reported failures. Some of them behaved as expected when -format msh2 was additionally supplied to the GMSH command line, but only a few. At that point, from an outsider's perspective, it is hard to judge if the Python variable values obtained through the test are still appropriate, in which case the thresholds can be updated, or if they are not. In the latter situation, once again, it is not straightforward to determine if the test was not robust enough against mesh changes or if something failed silently in the building/execution process.

Ideally, I think we would want to have only GEO files and accept that answers can change, whilst also having tests sufficiently robust so that answers would not change too much. The problem here is that it could mean quite a lot of work for these tests that currently fail with the newer versions of GMSH. I am saying could because a lot of those tests look very similar to me; there could be a common strategy to make them all more robust.

My experience is that only very few tests currently lack a GEO file, and an additional very few fail to run on Actions because of resource limits (time and/or number of CPU cores). These would indeed be good candidates for the alternative repository, although some committed MSH files are not that big. Moreover, other tests have committed files other than MSH that are quite big, so they could also be legitimate candidates.

Finally, regarding your comment about pushing large files, I think I have three branches where it most likely happened: testharness, testharnessClean and cmake_build_2022. Would deleting the branches be sufficient when changes are incorporated into main (or discarded)?