TDycores-Project / TDycore

BSD 2-Clause "Simplified" License
4 stars 0 forks source link

Adding tools for creating and pushing Docker images #152

Closed jeff-cohere closed 3 years ago

jeff-cohere commented 3 years ago

This PR intends to speed up our builds by using a Docker image that is tailor-made for the project. The tools used to generate this docker image live in the tools/ directory.

We were discussing a workflow that allowed us to automatically generate a Docker image when requested (somehow), but this gets us most of the way. We can further automate it if the tools here aren't convenient enough.

For now I've created a Docker image and stuck it in my own DockerHub account (coherellc). It's very easy to change where we store this, and I think we can also use GitHub's Docker registry, though it's very new and still in beta.

Closes #142

jeff-cohere commented 3 years ago

Okay, I've reproduced the internal compiler error we were seeing earlier in the Docker container. Will experiment with some other base Docker images to get other compilers.

jeff-cohere commented 3 years ago

Okay @bishtgautam. Things are building and running quickly now (at last!). However, I think we need to regenerate some of our testing baselines, because we're now using older compilers that don't exhibit the internal compiler error for the Fortran files (as discussed in #145 ). These old compilers produce different results from the newer ones.

How would you like to proceed?

bishtgautam commented 3 years ago

The make test-mpfao fails, but we can't see the log files to see how large is the difference is compared to the default tolerance. It appears that the first failure in the sequence of tests cause the whole thing to abort. Is there a way to see how big the difference between the current regression file and the baseline?

jeff-cohere commented 3 years ago

I added an option to allow tests to run even if one fails. The logs should appear now.

bishtgautam commented 3 years ago

I believe all the failures are for parallel tests. e.g. steady-wy-np3. I think mpiexec is not getting picked correctly as the test is being run as /bin/false -n 3 ...

jeff-cohere commented 3 years ago

Ah, yep. I think I can fix that.

jeff-cohere commented 3 years ago

Well now the tests seem to run, but a few of them are timing out. Strange.

jeff-cohere commented 3 years ago

Okay, here's the test log for the Richards driver test that's timing out:

/usr/bin/python3 regression_tests.py -e ../demo/richards/richards_driver --mpiexec "mpiexec" \
    --suite standard standard_parallel standard_exodus standard_parallel_exodus \
    --config-files ../demo/richards/richards.cfg \
    --logfile_prefix richards
  Test log file : test-richards-2021-02-26_23-44-45.testlog
Running regression tests :
...F
----------------------------------------------------------------------
Regression test summary:
    Total run time: 345.445 [s]
    Total tests : 4
    Tests run : 4
    Failed : 1

make: *** [makefile:92: test-richards] Error 1
Regression Test Log
Date : 2021-02-26_23-44-45
System Info :
    platform : linux
Test directory : 
    /__w/TDycore/TDycore/regression_tests

Repository status :
----------------------------
$ git log -1 HEAD
commit c41f61ece496fb32a1a58851c08ef738717ae245
Author: Jeffrey N. Johnson <jeff@cohere-llc.com>
Date:   Fri Feb 26 23:42:46 2021 +0000
    Merge 183a5e2ef17e10c5b3748cd085ea4ca7fcc9f62e into ee7e9574904cb2b2e0378d0be913ad5e45becc9f
$ git status -u no
HEAD detached at pull/152/merge
nothing to commit, working tree clean

PETSc information :
-------------------
* WARNING * This information may be incorrect if you have more than one version of petsc installed.
    PETSC_DIR : /usr/local/petsc/mpich-int32-real-opt
    petsc repository status :
    No git or hg directory was found in your PETSC_DIR

MPI information :
-----------------
$ mpiexec --version
HYDRA build details:
    Version:                                 3.3.2
    Release Date:                            Tue Nov 12 21:23:16 CST 2019
    CC:                              gcc    
    CXX:                             g++    
    F77:                             gfortran   
    F90:                             gfortran   
    Configure options:                       '--disable-option-checking' '--prefix=NONE' '--disable-wrapper-rpath' '--with-device=ch3:nemesis' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/build/mpich-3.3.2/src/mpl/include -I/build/mpich-3.3.2/src/mpl/include -I/build/mpich-3.3.2/src/openpa/src -I/build/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/build/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:       
    Demux engines available:                 poll select

================================================================================
WARNING : richards.cfg : Skipping requested suite 'standard_exodus' (not present, misspelled or empty).
WARNING : richards.cfg : Skipping requested suite 'standard_parallel_exodus' (not present, misspelled or empty).
Running tests from 'richards.cfg':
------------------------------------------------------------------------------------------
richards-driver-snes-prob1... 
    cd /__w/TDycore/TDycore/demo/richards
    /__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 2 -tdy_regression_test_filename richards-driver-snes-prob1 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -time_integration_method SNES
    # richards-driver-snes-prob1 : run time : 0.10 seconds
    diff richards-driver-snes-prob1.regression.gold richards-driver-snes-prob1.regression
richards-driver-snes-prob1... passed.
----------------------------------------
richards-driver-ts-prob1... 
    cd /__w/TDycore/TDycore/demo/richards
    /__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 2 -tdy_regression_test_filename richards-driver-ts-prob1 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_init_with_random_field -time_integration_method TS
    # richards-driver-ts-prob1 : run time : 0.70 seconds
    diff richards-driver-ts-prob1.regression.gold richards-driver-ts-prob1.regression
richards-driver-ts-prob1... passed.
----------------------------------------
richards-driver-snes-prob1-np4... 
    cd /__w/TDycore/TDycore/demo/richards
    mpiexec -n 4 /__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-snes-prob1-np4 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_init_with_random_field -time_integration_method SNES
    # richards-driver-snes-prob1-np4 : run time : 44.47 seconds
    diff richards-driver-snes-prob1-np4.regression.gold richards-driver-snes-prob1-np4.regression
richards-driver-snes-prob1-np4... passed.
----------------------------------------
richards-driver-ts-prob1-np4... 
    cd /__w/TDycore/TDycore/demo/richards
    mpiexec -n 4 /__w/TDycore/TDycore/demo/richards/richards_driver -malloc 0 -successful_exit_code 0 -dim 3 -Nx 2 -Ny 2 -Nz 2 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-ts-prob1-np4 -final_time 3.1536e3 -dt_max 600. -dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -time_integration_method TS

ERROR: job 'richards-driver-ts-prob1-np4' has exceeded timeout limit of 300.0 seconds.
    # richards-driver-ts-prob1-np4 : run time : 300.16 seconds

FAIL: could not find regression test file 'richards-driver-ts-prob1-np4.regression'. Please check the standard output file for errors.

richards-driver-ts-prob1-np4... failed.
--------------------------------------------------
richards.cfg : 4 tests :  1 tests failed,  3 tests passed
----------------------------------------------------------------------
Regression test file summary:
    /__w/TDycore/TDycore/demo/richards/richards.cfg... 4 tests :  1 tests failed,  3 passed.

----------------------------------------------------------------------
Regression test summary:
    Total run time: 345.445 [s]
    Total tests : 4
    Tests run : 4
    Failed : 1
jeff-cohere commented 3 years ago

The richards_driver test above (richards-driver-ts-prob1-np4) seems to be running fine, but it's so slow that it's timing out.

It seems like this might occur because it's a 4-processor problem running on pipsqueak hardware (it really seems like the jobs are limited to one or two cores). The th-driver-ts-prob1-np4 also runs out of time, for example.

On the other hand, the richards-driver-snes-prob1-np4 test seems to finish in a timely fashion. But that's a much shorter-running test, and its execution time is pretty pokey.

Maybe we should scale back to 2-processor tests and/or limit the number of time steps to prevent this from happening.

bishtgautam commented 3 years ago

@jeff-cohere We can try increasing the max time for tests. The _timeout is set to 60 sec in regression_tests.py#L113 and I cannot figure out where the limit is getting set as 300 sec.

jeff-cohere commented 3 years ago

@jeff-cohere We can try increasing the max time for tests. The _timeout is set to 60 sec in regression_tests.py#L113 and I cannot figure out where the limit is getting set as 300 sec.

Hey Gautam!

I'm more concerned about the fact that these tests seem to run a lot slower (or even hang, maybe) in GitHub's CI environment and not in our own environment. These timeouts/hangs kind of defeat the purpose of this PR, which is to accelerate the build/test process.

The biggest difference between our and GitHub's environment is that we're enabling code coverage on GitHub. Recent versions of GFortran have a bug that causes an internal compiler error when code coverage is enabled (see #145), which forced us to use older compiler versions. But now it seems like these older compilers produce executables that run tests noticeably slower. What a pain!

One solution might be for us to wait for the Ubuntu 21.04 release next month, which will likely include GCC/GFortran 11, which probably has a fix for this internal compiler error. If that's the case, we can use the latest compilers and see if they fix these hangs. The question is whether we can wait till next month with the build/test times we're seeing. There's not a ton of PR traffic right now, so maybe this is the easiest way to go.

bishtgautam commented 3 years ago

In that case, let's wait for Ubuntu 21.04.

jedbrown commented 3 years ago

Unless you have inside info, it seems all their testing is still on gcc-10 and it's getting late for an upgrade. https://packages.ubuntu.com/search?suite=hirsute&searchon=names&keywords=gcc I think the gcc docker images will ship gcc-11 as soon as it's released.

jeff-cohere commented 3 years ago

Good point. No insider info--just wishful thinking. My point is mostly that this PR is turning into a game of "find the working Fortran compiler," so it may be best to wait.

jeff-cohere commented 3 years ago

Okay, finally. With @bishtgautam 's PR in, we're no longer measuring code coverage on the problematic Fortran tests, and I've worked through the remaining issues with the Docker image.