NOAA-EMC / JEDI-T2O

JEDI Transition to Operations activities.
GNU Lesser General Public License v2.1
5 stars 4 forks source link

Add a custom CMakeLists.txt, module files, and build script #5

Closed CoryMartin-NOAA closed 2 years ago

CoryMartin-NOAA commented 2 years ago

This PR adds 4 new files and a set of directories within a UFS-DA subdirectory to mimic how the GSI repository is set up.

There are Lua modules for both Orion and Hera to load the JEDI stacks. There is a custom CMakeLists.txt so that there is no need to clone a 'bundle'. There is also a new ush/build_UFSDA.sh script that will do everything automatically for the user.

CoryMartin-NOAA commented 2 years ago
  • I suspect the failures are due to me not having my environment properly configured instead of flagging real problems with the checkout and install.

@RussTreadon-NOAA try:

module use /path/to/clone/modulefiles
module load UFSDA/hera

and then try ctest. The iodaconv tests will still fail because the JEDI python environment is not good enough (yet) so I will turn that build off in the next commit. Also some CRTM tests failed for me that I think are due to an older tag and will also update that.

RussTreadon-NOAA commented 2 years ago

Thank you @CoryMartin-NOAA . I updated to 5caae07 & rebuilt. After this

[Russ.Treadon@hfe12 modulefiles]$ module use /scratch1/NCEPDEV/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/modulefiles
[Russ.Treadon@hfe12 modulefiles]$ module load UFSDA/hera

followed by ctest. 353 out of 995 tests failed with the following modules loaded

[Russ.Treadon@hfe12 build]$ module list

Currently Loaded Modules:
  1) rocoto/1.3.3           8) zlib/1.2.11       15) netcdf/4.7.4          22) ecbuild/ecmwf-3.6.1  29) json-schema-validator/2.1.0
  2) contrib                9) udunits/2.2.28    16) boost-headers/1.68.0  23) eckit/ecmwf-1.16.0   30) jedi/intel-impi/2020.2/2020.2
  3) git-lfs/2.11.0        10) gsl_lite/0.37.0   17) eigen/3.3.7           24) fckit/ecmwf-0.9.2    31) UFSDA/hera
  4) intelpython/2021.3.0  11) impi/2020.2       18) bufr/noaa-emc-11.5.0  25) atlas/ecmwf-0.24.1
  5) intel/2020.2          12) jedi-impi/2020.2  19) nccmp/1.8.7.0         26) nco/4.9.1
  6) jedi-intel/2020.2     13) hdf5/1.12.0       20) pio/2.5.1-debug       27) pybind11/2.7.0
  7) szip/2.1.1            14) pnetcdf/1.12.1    21) cmake/3.20.1          28) json/3.9.1

The ctest log file is /scratch1/NCEPDEV/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/build/Testing/Temporary/LastTest.log

CatherineThomas-NOAA commented 2 years ago

I attempted running earlier in the day on Hera and ran into issues, but after git pulling some updates, I was able to build successfully on both Hera and Orion.

I ran ctest on both machines, loading UFSDA/{machine} first as suggested. 351 tests failed on Orion and 484 tests failed on Hera.

CoryMartin-NOAA commented 2 years ago

Thank you @RussTreadon-NOAA and @CatherineThomas-NOAA ! My first guess (but I'll check tomorrow) is that the tests that failed are the MPI tests.

The build script has:

        export SLURM_ACCOUNT=${SLURM_ACCOUNT:-"da-cpu"}
    export SALLOC_ACCOUNT=${SALLOC_ACCOUNT:-$SLURM_ACCOUNT}
    export SBATCH_ACCOUNT=${SBATCH_ACCOUNT:-$SLURM_ACCOUNT}
    export SLURM_QOS=${SLURM_QOS:-"debug"}

which tells SLURM what account, etc. to use. My guess is that those are not in your env and thus srun cannot submit properly.

The build script now has an option to run the tests immediately after building if you call it like so: run_tests='YES' ./build_UFSDA.sh (thanks to @aerorahul for that suggestion)

RussTreadon-NOAA commented 2 years ago

Thanks @CoryMartin-NOAA (and @aerorahul!) for the option. Execution of run_tests='YES' ./build_UFSDA.sh in my Hera working directory resulted in 992 out of 995 tests passing. The following tests failed

Total Test time (real) = 2347.85 sec

The following tests FAILED:
        730 - test_ufo_obserror_assign_unittests (Failed)
        795 - test_ufo_function_drawvaluefromfile (Failed)
        809 - test_ufo_function_satwind_indiv_errors (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/NCEPDEV/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

I will examine the FAILED cases more closely tomorrow.

CoryMartin-NOAA commented 2 years ago

I have the same 3 tests fail for me on Hera that @RussTreadon-NOAA noted. Given that it is only 3 tests, I am inclined to say that it is something code/data related and we need not worry about it at this time. Although it is curious why these fail because my cron job has not indicated that any tests fail on Hera. The only difference I can think of is the JCSDA 'public' repositories vs the JCSDA-internal 'private' ones.

CatherineThomas-NOAA commented 2 years ago

I reran on Orion with the inline testing option.

99% tests passed, 7 tests failed out of 995

The following tests FAILED:
        562 - test_ufo_opr_atminterplay_ompsnp_npp_flipz (Failed)
        577 - test_ufo_opr_vertinterp_radiosonde (Failed)
        586 - test_ufo_opr_scatwindNeutralMetOffice (Failed)
        588 - test_ufo_linopr_timeoper (Failed)
        911 - test_femps_csgrid (Failed)
        915 - fv3jedi_test_tier1_geometry_gfs (Failed)
        919 - fv3jedi_test_tier1_geometry_2d (Failed)

Like the Hera case, I'm also inclined to thinking that this is not due to our setup, but the current repos themselves.

CoryMartin-NOAA commented 2 years ago

Thanks @CatherineThomas-NOAA I only had one test fail on Orion... 911 - test_femps_csgrid (Failed)

It had a segmentation fault when running it interactively. By rerunning it with ulimit -s unlimited interactively, it passed.

@aerorahul should I add the ulimit -s unlimited to the build/test script? I assume that is where that should go?

RussTreadon-NOAA commented 2 years ago

I observed the same behavior as @CoryMartin-NOAA did on Orion

Total Test time (real) = 3330.70 sec

The following tests FAILED:
        911 - test_femps_csgrid (Failed)
Errors while running CTest
Orion-login-2[35] rtreadon$ ulimit -s unlimited
Orion-login-2[36] rtreadon$ ctest -R test_femps_csgrid
Test project /work/noaa/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/build
    Start 911: test_femps_csgrid
1/1 Test #911: test_femps_csgrid ................   Passed    7.99 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
executable    =   7.99 sec*proc (1 test)
femps         =   7.99 sec*proc (1 test)

Total Test time (real) =   8.49 sec
Orion-login-2[37] rtreadon$
aerorahul commented 2 years ago

Thanks @CatherineThomas-NOAA I only had one test fail on Orion... 911 - test_femps_csgrid (Failed)

It had a segmentation fault when running it interactively. By rerunning it with ulimit -s unlimited interactively, it passed.

@aerorahul should I add the ulimit -s unlimited to the build/test script? I assume that is where that should go?

Probably needed in the if-block for hera and orion.

Thinking more about this, why are we running tests in the build script? And if we indeed do want to, why do we care about running all the tests from eckit down to fv3-jedi? When a tag/branch is added to the workflow, we should ensure that it has been tested offline. JEDI allows us to do that.

I would remove the ctest option completely from this. If someone is doing development on the fv3-bundle, they can do these tests there.

CoryMartin-NOAA commented 2 years ago

@aerorahul well the default is to build without testing and it only runs the tests if the user explicitly tells it to. I see no harm in this as-is but will defer to what others think.

RussTreadon-NOAA commented 2 years ago

Since the default is to not run ctest upon installation, I'm fine with leaving the option. It's something I would use as part of installing the package on a new machine. That said, I get @aerorahul 's point. The name ./build_UFSDA.sh indicates that the script builds (installs) the package. One can manually execute all or specific ctests after installation. Sorry, this isn't a clear keep it or remove it comment.

CoryMartin-NOAA commented 2 years ago

With the most recent commit, to build please run either: BUILD_TARGET=orion ./build_UFSDA.sh or BUILD_TARGET=hera ./build_UFSDA.sh depending on platform