shallow water tests are too long

ecoon commented 11 months ago

Please take a look at the shallow water tests -- do we really need to run them for this long? Maybe restart files could be used if you need to get key periods in time, or just run a short time window, or a smaller mesh? These tests take a very large fraction of our CI runtime:

        Start 148: shallow_water_1D
148/221 Test #148: shallow_water_1D ...................................................................................   Passed  198.34 sec
        Start 149: shallow_water_2D_smooth
149/221 Test #149: shallow_water_2D_smooth ............................................................................   Passed   20.99 sec
        Start 150: shallow_water_bathymetry
150/221 Test #150: shallow_water_bathymetry ...........................................................................   Passed    0.42 sec
        Start 151: shallow_water_lake_at_rest
151/221 Test #151: shallow_water_lake_at_rest .........................................................................   Passed  350.36 sec
        Start 152: shallow_water_Thacker
152/221 Test #152: shallow_water_Thacker ..............................................................................   Passed   96.72 sec
        Start 153: shallow_water_dry_bed
153/221 Test #153: shallow_water_dry_bed ..............................................................................   Passed  1011.80 sec

gcapodag commented 11 months ago

If I may chime in on this discussion, while working on the pipe model, I had to make a change to the way the boundary conditions are handled to ensure mass conservation. With that change, the lake at rest is now failing during continuous integration. @nvohra0016 we can talk about it at our meeting Thursday, but I thought mentioning it here was relevant to thread.

lipnikov commented 11 months ago

The tests are accutally fast. on OSX, I am getting this:

1/6 Test #1: shallow_water_1D .................   Passed    1.93 sec
2/6 Test #2: shallow_water_2D_smooth ..........   Passed    0.79 sec
3/6 Test #3: shallow_water_bathymetry .........   Passed    0.55 sec
4/6 Test #4: shallow_water_lake_at_rest .......   Passed    1.48 sec
5/6 Test #5: shallow_water_Thacker ............   Passed    6.61 sec
6/6 Test #6: shallow_water_dry_bed ............   Passed    2.91 sec

nvohra0016 commented 11 months ago

If I may chime in on this discussion, while working on the pipe model, I had to make a change to the way the boundary conditions are handled to ensure mass conservation. With that change, the lake at rest is now failing during continuous integration. @nvohra0016 we can talk about it at our meeting Thursday, but I thought mentioning it here was relevant to thread.

@gcapodag Thank you for the heads up!

nvohra0016 commented 11 months ago

@stokareva @lipnikov The tests that take the longest time are lake_at_rest and dry_bed, both of which use two meshes: hexagonal and triangular/rectangular. One way around the issue is to simply run both the tests on hexagonal meshes alone (which covers the general polygonal case).

jd-moulton commented 11 months ago

It's interesting they are so much faster on OSX. @nvohra0016 how many cores are these tests using (note for CI on github CI its bad to use more than two cores with our current docker builds)? Also could we use a coarser mesh, or cut the run off at an earlier time, or start with a checkpoint file so running to the end is shorter.

nvohra0016 commented 11 months ago

@jd-moulton All the SW tests are using 3 processors currently. I can reduce that to 2, and also work on using a coarser mesh.

ecoon commented 11 months ago

The tests are accutally fast. on OSX, I am getting this:


1/6 Test #1: shallow_water_1D .................   Passed    1.93 sec

2/6 Test #2: shallow_water_2D_smooth ..........   Passed    0.79 sec

3/6 Test #3: shallow_water_bathymetry .........   Passed    0.55 sec

4/6 Test #4: shallow_water_lake_at_rest .......   Passed    1.48 sec

5/6 Test #5: shallow_water_Thacker ............   Passed    6.61 sec

6/6 Test #6: shallow_water_dry_bed ............   Passed    2.91 sec

So the numbers I was quoting were from CI. Are they parallel? I know we've seen slow mpi problems before.

(Edit: just saw David's comment after replying... what he said!)

jd-moulton commented 11 months ago

Running on only two cores (instead of three) and with a coarser mesh should fix things. Let's see how much that helps at least.

lipnikov commented 11 months ago

Running with only 2 cores is a bad strategy, since this eliminates complex mesh partitions. We need coarser meshes for some tests but run them with more cores.

rfiorella commented 11 months ago

GitHub does have some 3 CPU runners in the free tier, but they run macOS: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners

lipnikov commented 11 months ago

Performance of flow_darcy_mics gives no hope. 0.29 sec on 2 cores and 121s on 4 cores

jd-moulton commented 11 months ago

Overscribing cores in github actions seems to work when it is outside docker. I think it's worth trying to figure out what's going wrong with our docker container in this case. Maybe there's a way to do it. Or maybe we can switch github actions over to a native build (would need to use caching for the TPLs).

rfiorella commented 10 months ago

Recognizing that this is treating the symptom of "CI takes too long" rather than solving the underlying problem here, what about splitting the test suite across multiple runners?

I tried a quick version of this in GHA #1792 and it cut CI time by 20 minutes; with tests better balanced across the two (or more) runners, could probably reduce by 30-40 minutes overall.

jd-moulton commented 10 months ago

Anything to speed up testing is helpful. I would appreciate you giving it a try. As the CI stands now it just runs ctest with no options. It would be easy to add options using our labels or test numbers etc., to control which tests run in which runner.

lipnikov commented 10 months ago

ctest -j 2 requires only 4 min and 20 second on a modern laptop. And this with all 266 tests. CI is way behind.

rfiorella commented 10 months ago

@lipnikov are you compiling in macOS with openmpi or mpich?

Found this today in mpich FAQ: https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md#q-why-does-my-mpi-program-run-much-slower-when-i-use-more-processes, which seems to suggest CI might run faster if: a) compiled mpich in the TPLs container with the --with-device=ch3:sock option, or b) tried openmpi.

(As an update: compiling TPLs/amanzi with mpich configured with --with-device=ch3:sock reduces test time to 7 minutes. - most recent CI run compiled the TPLs, amanzi, and ran the tests faster than average CI run currently.)

jd-moulton commented 10 months ago

I was using openmpi in the CI initially. Switched to mpich because initially it seemed to solve the oversubscription problem. But sadly it didn't. If compiling our own mpich with special options works, that's great.

We could also try openmpi again. The only catch there is it explicitly needs the --oversubscribe option passed to mpirun/mpiexec.

rfiorella commented 10 months ago

I think compiling our own mpich should work and resolve this issue - I'll try it with a new TPL build this evening. #781 has an updated Dockerfile that should still allow precompiled openmpi to be used, but will compile mpich if chosen as the MPI library.

rfiorella commented 10 months ago

@jd-moulton @ecoon Updated TPLs with our own version of mpich now being available on dockerhub. I have tested a few times without any problems, but the previous version is available as metsi/amanzi-tpls:0.98.7-mpich-bak if there are any issues.

Seems to cut CI time roughly in half for both amanzi and ats in the handful of times the CI action has run. Shallow water tests now take about a minute.

jd-moulton commented 10 months ago

@rfiorella This looks great. A quick manual comparison of a couple of tests looks like parallel execution time is reduced by about a factor of 100. For example, recent runs look like:

flow_darcy_misc_4pe goes from 120.60s to 0.92s.
shallow_water_dry_bed goes from 993.62s to 9.99s.

So about 100 times faster. If we look at the summary information for the parallel label, we see

timing goes from 12294.84[sproc] to 529.35[sproc]

This is about a factor of 23, and if you considered a 4 processor test, than this too is about 100. So I think we can consider the MPI communication bottleneck solved - nice work.

Lastly, this is very close to what we need to do to run on NERSC under their shifter tool. At least it starts to address the mpich build, just need to figure out the LD_LIBRARY_PATH part. It would be great if we could do that with a couple of arguments we pass into the Dockerfile as opposed to having to maintain multiple docker files.

lipnikov commented 10 months ago

@rfiorella @jd-moulton Since performance is no longer a bottleneck, can we build and run more Amanzi unit tests, 266 instead of 244 now?

jd-moulton commented 10 months ago

@lipnikov, I agree we should add the full set of regression tests back in and see how long it takes.

lipnikov commented 10 months ago

@jd-moulton I was talking about unit tests. Regression tests may take 1-2 hours to run.

jd-moulton commented 10 months ago

@lipnikov My mistake, but arises I guess I didn't realize we were dropping some unit/integration tests. Yes, definitely we should add those back in. What tests have we been dropping?

For the regression tests, it depends on how many cores and runners, but yes it could add an hour or more and so maybe that's something that is done separately each night?

lipnikov commented 10 months ago

I think most missing tests are provided by the structured component of Amanzi. Change --disable-structured to --enable-structured Dockerfile-Amanzi

jd-moulton commented 10 months ago

We could enable structured now that the builds are working well under GitHub actions. Structured was disabled because of problems with the builds under Travis CI.

rfiorella commented 10 months ago

Change has been made on rfiorella/ctest-split. I'll take a look at the balance between runners after it's done and update the PR as needed.

rfiorella commented 10 months ago

@jd-moulton @lipnikov Using --enable-structured adds 2.5GB to the compressed docker image, and the uncompressed image size grows to from 3 to 18.1 GB (!). Is this expected?

jd-moulton commented 10 months ago

Yikes, not really expected. The libraries that it uses are already there, so not sure why it is that large. Does each test have it's own binary? Could it be linking statically? Is it related to the fortran *mod files? If you can check the image to see where the bloat lives maybe we can control it. Alternatively, we could exclude it from the build we save on Dockerhub and just build it as part of running the structured tests (we could give them a label).

lipnikov commented 10 months ago

The size of the structured build directory is just 32M on my system with all tests included.

rfiorella commented 10 months ago

The size jump in the CI image does appear to be related to the -g compiler flag that gets appended to CMAKE_CXX_FLAGS, I think through tools/cmake/CCSEOptions.cmake as commenting out L. 66-68 in tools/cmake/CCSEOptions.cmake resolves the issue.

Are this compiler option still necessary? (Or maybe added for a past TPL build?)

jd-moulton commented 10 months ago

We could make this an option, but maybe its covered by CMake build types. Specifically, I think it gets added automatically for CMake build type with release with debug info (note build types: Debug, Release, RelWithDebInfo and MinSizeRel). So if we could verify that RelWithDebInfo does add -g, we could drop forcing it in the release case for structured.

rfiorella commented 10 months ago

RelWithDebInfo does add -g correctly. Commenting out those lines in tools/cmake/CCSEOptions.cmake and building image image with RelWithDebInfo yields build with -g, and compressed image size is 3GB. Building release (without RelWithDebInfo) removes the -g flag and yields image that closer to 1GB compressed.

Shall we move ahead with no longer forcing it in release case for structured? If so, I'll clean up my PR and include this in the final PR to be merged.

jd-moulton commented 10 months ago

@rfiorella, yes this sounds good. It's better to use the mechanism provided by CMake rather than make it a special feature of part of our build. I think we should move ahead and no longer force it in the release case for structured, and add a note in bootstrap about the behavior of this build type.

amanzi / amanzi

shallow water tests are too long #773