Closed jeff-cohere closed 3 years ago
Okay, I've reproduced the internal compiler error we were seeing earlier in the Docker container. Will experiment with some other base Docker images to get other compilers.
Okay @bishtgautam. Things are building and running quickly now (at last!). However, I think we need to regenerate some of our testing baselines, because we're now using older compilers that don't exhibit the internal compiler error for the Fortran files (as discussed in #145 ). These old compilers produce different results from the newer ones.
How would you like to proceed?
The make test-mpfao
fails, but we can't see the log files to see how large is the difference is compared to the default tolerance. It appears that the first failure in the sequence of tests cause the whole thing to abort. Is there a way to see how big the difference between the current regression file and the baseline?
I added an option to allow tests to run even if one fails. The logs should appear now.
I believe all the failures are for parallel tests. e.g. steady-wy-np3. I think mpiexec
is not getting picked correctly as the test is being run as /bin/false -n 3 ...
Ah, yep. I think I can fix that.
Well now the tests seem to run, but a few of them are timing out. Strange.
Okay, here's the test log for the Richards driver test that's timing out:
/usr/bin/python3 regression_tests.py -e ../demo/richards/richards_driver --mpiexec "mpiexec" \
--suite standard standard_parallel standard_exodus standard_parallel_exodus \
--config-files ../demo/richards/richards.cfg \
--logfile_prefix richards
Test log file : test-richards-2021-02-26_23-44-45.testlog
Running regression tests :
...F
----------------------------------------------------------------------
Regression test summary:
Total run time: 345.445 [s]
Total tests : 4
Tests run : 4
Failed : 1
make: *** [makefile:92: test-richards] Error 1
Regression Test Log
Date : 2021-02-26_23-44-45
System Info :
platform : linux
Test directory :
/__w/TDycore/TDycore/regression_tests
Repository status :
----------------------------
$ git log -1 HEAD
commit c41f61ece496fb32a1a58851c08ef738717ae245
Author: Jeffrey N. Johnson <jeff@cohere-llc.com>
Date: Fri Feb 26 23:42:46 2021 +0000
Merge 183a5e2ef17e10c5b3748cd085ea4ca7fcc9f62e into ee7e9574904cb2b2e0378d0be913ad5e45becc9f
$ git status -u no
HEAD detached at pull/152/merge
nothing to commit, working tree clean
PETSc information :
-------------------
* WARNING * This information may be incorrect if you have more than one version of petsc installed.
PETSC_DIR : /usr/local/petsc/mpich-int32-real-opt
petsc repository status :
No git or hg directory was found in your PETSC_DIR
MPI information :
-----------------
$ mpiexec --version
HYDRA build details:
Version: 3.3.2
Release Date: Tue Nov 12 21:23:16 CST 2019
CC: gcc
CXX: g++
F77: gfortran
F90: gfortran
Configure options: '--disable-option-checking' '--prefix=NONE' '--disable-wrapper-rpath' '--with-device=ch3:nemesis' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/build/mpich-3.3.2/src/mpl/include -I/build/mpich-3.3.2/src/mpl/include -I/build/mpich-3.3.2/src/openpa/src -I/build/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/build/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
================================================================================
WARNING : richards.cfg : Skipping requested suite 'standard_exodus' (not present, misspelled or empty).
WARNING : richards.cfg : Skipping requested suite 'standard_parallel_exodus' (not present, misspelled or empty).
Running tests from 'richards.cfg':
------------------------------------------------------------------------------------------
richards-driver-snes-prob1...
cd /__w/TDycore/TDycore/demo/richards
/__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 2 -tdy_regression_test_filename richards-driver-snes-prob1 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -time_integration_method SNES
# richards-driver-snes-prob1 : run time : 0.10 seconds
diff richards-driver-snes-prob1.regression.gold richards-driver-snes-prob1.regression
richards-driver-snes-prob1... passed.
----------------------------------------
richards-driver-ts-prob1...
cd /__w/TDycore/TDycore/demo/richards
/__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 2 -tdy_regression_test_filename richards-driver-ts-prob1 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_init_with_random_field -time_integration_method TS
# richards-driver-ts-prob1 : run time : 0.70 seconds
diff richards-driver-ts-prob1.regression.gold richards-driver-ts-prob1.regression
richards-driver-ts-prob1... passed.
----------------------------------------
richards-driver-snes-prob1-np4...
cd /__w/TDycore/TDycore/demo/richards
mpiexec -n 4 /__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-snes-prob1-np4 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_init_with_random_field -time_integration_method SNES
# richards-driver-snes-prob1-np4 : run time : 44.47 seconds
diff richards-driver-snes-prob1-np4.regression.gold richards-driver-snes-prob1-np4.regression
richards-driver-snes-prob1-np4... passed.
----------------------------------------
richards-driver-ts-prob1-np4...
cd /__w/TDycore/TDycore/demo/richards
mpiexec -n 4 /__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-ts-prob1-np4 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -time_integration_method TS
ERROR: job 'richards-driver-ts-prob1-np4' has exceeded timeout limit of 300.0 seconds.
# richards-driver-ts-prob1-np4 : run time : 300.16 seconds
FAIL: could not find regression test file 'richards-driver-ts-prob1-np4.regression'. Please check the standard output file for errors.
richards-driver-ts-prob1-np4... failed.
--------------------------------------------------
richards.cfg : 4 tests : 1 tests failed, 3 tests passed
----------------------------------------------------------------------
Regression test file summary:
/__w/TDycore/TDycore/demo/richards/richards.cfg... 4 tests : 1 tests failed, 3 passed.
----------------------------------------------------------------------
Regression test summary:
Total run time: 345.445 [s]
Total tests : 4
Tests run : 4
Failed : 1
The richards_driver
test above (richards-driver-ts-prob1-np4
) seems to be running fine, but it's so slow that it's timing out.
It seems like this might occur because it's a 4-processor problem running on pipsqueak hardware (it really seems like the jobs are limited to one or two cores). The th-driver-ts-prob1-np4
also runs out of time, for example.
On the other hand, the richards-driver-snes-prob1-np4
test seems to finish in a timely fashion. But that's a much shorter-running test, and its execution time is pretty pokey.
Maybe we should scale back to 2-processor tests and/or limit the number of time steps to prevent this from happening.
@jeff-cohere We can try increasing the max time for tests. The _timeout
is set to 60 sec in regression_tests.py#L113 and I cannot figure out where the limit is getting set as 300 sec.
@jeff-cohere We can try increasing the max time for tests. The
_timeout
is set to 60 sec in regression_tests.py#L113 and I cannot figure out where the limit is getting set as 300 sec.
Hey Gautam!
I'm more concerned about the fact that these tests seem to run a lot slower (or even hang, maybe) in GitHub's CI environment and not in our own environment. These timeouts/hangs kind of defeat the purpose of this PR, which is to accelerate the build/test process.
The biggest difference between our and GitHub's environment is that we're enabling code coverage on GitHub. Recent versions of GFortran have a bug that causes an internal compiler error when code coverage is enabled (see #145), which forced us to use older compiler versions. But now it seems like these older compilers produce executables that run tests noticeably slower. What a pain!
One solution might be for us to wait for the Ubuntu 21.04 release next month, which will likely include GCC/GFortran 11, which probably has a fix for this internal compiler error. If that's the case, we can use the latest compilers and see if they fix these hangs. The question is whether we can wait till next month with the build/test times we're seeing. There's not a ton of PR traffic right now, so maybe this is the easiest way to go.
In that case, let's wait for Ubuntu 21.04.
Unless you have inside info, it seems all their testing is still on gcc-10 and it's getting late for an upgrade. https://packages.ubuntu.com/search?suite=hirsute&searchon=names&keywords=gcc I think the gcc docker images will ship gcc-11 as soon as it's released.
Good point. No insider info--just wishful thinking. My point is mostly that this PR is turning into a game of "find the working Fortran compiler," so it may be best to wait.
Okay, finally. With @bishtgautam 's PR in, we're no longer measuring code coverage on the problematic Fortran tests, and I've worked through the remaining issues with the Docker image.
This PR intends to speed up our builds by using a Docker image that is tailor-made for the project. The tools used to generate this docker image live in the
tools/
directory.We were discussing a workflow that allowed us to automatically generate a Docker image when requested (somehow), but this gets us most of the way. We can further automate it if the tools here aren't convenient enough.
For now I've created a Docker image and stuck it in my own DockerHub account (
coherellc
). It's very easy to change where we store this, and I think we can also use GitHub's Docker registry, though it's very new and still in beta.Closes #142