Closed ecoon closed 7 months ago
If I may chime in on this discussion, while working on the pipe model, I had to make a change to the way the boundary conditions are handled to ensure mass conservation. With that change, the lake at rest is now failing during continuous integration. @nvohra0016 we can talk about it at our meeting Thursday, but I thought mentioning it here was relevant to thread.
The tests are accutally fast. on OSX, I am getting this:
1/6 Test #1: shallow_water_1D ................. Passed 1.93 sec
2/6 Test #2: shallow_water_2D_smooth .......... Passed 0.79 sec
3/6 Test #3: shallow_water_bathymetry ......... Passed 0.55 sec
4/6 Test #4: shallow_water_lake_at_rest ....... Passed 1.48 sec
5/6 Test #5: shallow_water_Thacker ............ Passed 6.61 sec
6/6 Test #6: shallow_water_dry_bed ............ Passed 2.91 sec
If I may chime in on this discussion, while working on the pipe model, I had to make a change to the way the boundary conditions are handled to ensure mass conservation. With that change, the lake at rest is now failing during continuous integration. @nvohra0016 we can talk about it at our meeting Thursday, but I thought mentioning it here was relevant to thread.
@gcapodag Thank you for the heads up!
@stokareva @lipnikov The tests that take the longest time are lake_at_rest and dry_bed, both of which use two meshes: hexagonal and triangular/rectangular. One way around the issue is to simply run both the tests on hexagonal meshes alone (which covers the general polygonal case).
It's interesting they are so much faster on OSX. @nvohra0016 how many cores are these tests using (note for CI on github CI its bad to use more than two cores with our current docker builds)? Also could we use a coarser mesh, or cut the run off at an earlier time, or start with a checkpoint file so running to the end is shorter.
@jd-moulton All the SW tests are using 3 processors currently. I can reduce that to 2, and also work on using a coarser mesh.
The tests are accutally fast. on OSX, I am getting this:
1/6 Test #1: shallow_water_1D ................. Passed 1.93 sec 2/6 Test #2: shallow_water_2D_smooth .......... Passed 0.79 sec 3/6 Test #3: shallow_water_bathymetry ......... Passed 0.55 sec 4/6 Test #4: shallow_water_lake_at_rest ....... Passed 1.48 sec 5/6 Test #5: shallow_water_Thacker ............ Passed 6.61 sec 6/6 Test #6: shallow_water_dry_bed ............ Passed 2.91 sec
So the numbers I was quoting were from CI. Are they parallel? I know we've seen slow mpi problems before.
(Edit: just saw David's comment after replying... what he said!)
Running on only two cores (instead of three) and with a coarser mesh should fix things. Let's see how much that helps at least.
Running with only 2 cores is a bad strategy, since this eliminates complex mesh partitions. We need coarser meshes for some tests but run them with more cores.
GitHub does have some 3 CPU runners in the free tier, but they run macOS: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners
Performance of flow_darcy_mics gives no hope. 0.29 sec on 2 cores and 121s on 4 cores
Overscribing cores in github actions seems to work when it is outside docker. I think it's worth trying to figure out what's going wrong with our docker container in this case. Maybe there's a way to do it. Or maybe we can switch github actions over to a native build (would need to use caching for the TPLs).
Recognizing that this is treating the symptom of "CI takes too long" rather than solving the underlying problem here, what about splitting the test suite across multiple runners?
I tried a quick version of this in GHA #1792 and it cut CI time by 20 minutes; with tests better balanced across the two (or more) runners, could probably reduce by 30-40 minutes overall.
Anything to speed up testing is helpful. I would appreciate you giving it a try. As the CI stands now it just runs ctest with no options. It would be easy to add options using our labels or test numbers etc., to control which tests run in which runner.
ctest -j 2 requires only 4 min and 20 second on a modern laptop. And this with all 266 tests. CI is way behind.
@lipnikov are you compiling in macOS with openmpi or mpich?
Found this today in mpich FAQ: https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md#q-why-does-my-mpi-program-run-much-slower-when-i-use-more-processes, which seems to suggest CI might run faster if: a) compiled mpich in the TPLs container with the --with-device=ch3:sock
option, or b) tried openmpi.
(As an update: compiling TPLs/amanzi with mpich configured with --with-device=ch3:sock
reduces test time to 7 minutes. - most recent CI run compiled the TPLs, amanzi, and ran the tests faster than average CI run currently.)
I was using openmpi in the CI initially. Switched to mpich because initially it seemed to solve the oversubscription problem. But sadly it didn't. If compiling our own mpich with special options works, that's great.
We could also try openmpi again. The only catch there is it explicitly needs the --oversubscribe option passed to mpirun/mpiexec.
I think compiling our own mpich should work and resolve this issue - I'll try it with a new TPL build this evening. #781 has an updated Dockerfile that should still allow precompiled openmpi to be used, but will compile mpich if chosen as the MPI library.
@jd-moulton @ecoon Updated TPLs with our own version of mpich now being available on dockerhub. I have tested a few times without any problems, but the previous version is available as metsi/amanzi-tpls:0.98.7-mpich-bak if there are any issues.
Seems to cut CI time roughly in half for both amanzi and ats in the handful of times the CI action has run. Shallow water tests now take about a minute.
@rfiorella This looks great. A quick manual comparison of a couple of tests looks like parallel execution time is reduced by about a factor of 100. For example, recent runs look like:
So about 100 times faster. If we look at the summary information for the parallel label, we see
This is about a factor of 23, and if you considered a 4 processor test, than this too is about 100. So I think we can consider the MPI communication bottleneck solved - nice work.
Lastly, this is very close to what we need to do to run on NERSC under their shifter tool. At least it starts to address the mpich build, just need to figure out the LD_LIBRARY_PATH part. It would be great if we could do that with a couple of arguments we pass into the Dockerfile as opposed to having to maintain multiple docker files.
@rfiorella @jd-moulton Since performance is no longer a bottleneck, can we build and run more Amanzi unit tests, 266 instead of 244 now?
@lipnikov, I agree we should add the full set of regression tests back in and see how long it takes.
@jd-moulton I was talking about unit tests. Regression tests may take 1-2 hours to run.
@lipnikov My mistake, but arises I guess I didn't realize we were dropping some unit/integration tests. Yes, definitely we should add those back in. What tests have we been dropping?
For the regression tests, it depends on how many cores and runners, but yes it could add an hour or more and so maybe that's something that is done separately each night?
I think most missing tests are provided by the structured component of Amanzi. Change --disable-structured to --enable-structured Dockerfile-Amanzi
We could enable structured now that the builds are working well under GitHub actions. Structured was disabled because of problems with the builds under Travis CI.
Change has been made on rfiorella/ctest-split. I'll take a look at the balance between runners after it's done and update the PR as needed.
@jd-moulton @lipnikov Using --enable-structured adds 2.5GB to the compressed docker image, and the uncompressed image size grows to from 3 to 18.1 GB (!). Is this expected?
Yikes, not really expected. The libraries that it uses are already there, so not sure why it is that large. Does each test have it's own binary? Could it be linking statically? Is it related to the fortran *mod files? If you can check the image to see where the bloat lives maybe we can control it. Alternatively, we could exclude it from the build we save on Dockerhub and just build it as part of running the structured tests (we could give them a label).
The size of the structured build directory is just 32M on my system with all tests included.
The size jump in the CI image does appear to be related to the -g compiler flag that gets appended to CMAKE_CXX_FLAGS, I think through tools/cmake/CCSEOptions.cmake as commenting out L. 66-68 in tools/cmake/CCSEOptions.cmake resolves the issue.
Are this compiler option still necessary? (Or maybe added for a past TPL build?)
We could make this an option, but maybe its covered by CMake build types. Specifically, I think it gets added automatically for CMake build type with release with debug info (note build types: Debug, Release, RelWithDebInfo and MinSizeRel). So if we could verify that RelWithDebInfo does add -g, we could drop forcing it in the release case for structured.
RelWithDebInfo does add -g correctly. Commenting out those lines in tools/cmake/CCSEOptions.cmake and building image image with RelWithDebInfo yields build with -g, and compressed image size is 3GB. Building release (without RelWithDebInfo) removes the -g flag and yields image that closer to 1GB compressed.
Shall we move ahead with no longer forcing it in release case for structured? If so, I'll clean up my PR and include this in the final PR to be merged.
@rfiorella, yes this sounds good. It's better to use the mechanism provided by CMake rather than make it a special feature of part of our build. I think we should move ahead and no longer force it in the release case for structured, and add a note in bootstrap about the behavior of this build type.
Please take a look at the shallow water tests -- do we really need to run them for this long? Maybe restart files could be used if you need to get key periods in time, or just run a short time window, or a smaller mesh? These tests take a very large fraction of our CI runtime: