Closed CoryMartin-NOAA closed 2 years ago
- I suspect the failures are due to me not having my environment properly configured instead of flagging real problems with the checkout and install.
@RussTreadon-NOAA try:
module use /path/to/clone/modulefiles
module load UFSDA/hera
and then try ctest
. The iodaconv
tests will still fail because the JEDI python environment is not good enough (yet) so I will turn that build off in the next commit. Also some CRTM tests failed for me that I think are due to an older tag and will also update that.
Thank you @CoryMartin-NOAA . I updated to 5caae07 & rebuilt. After this
[Russ.Treadon@hfe12 modulefiles]$ module use /scratch1/NCEPDEV/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/modulefiles
[Russ.Treadon@hfe12 modulefiles]$ module load UFSDA/hera
followed by ctest. 353 out of 995 tests failed with the following modules loaded
[Russ.Treadon@hfe12 build]$ module list
Currently Loaded Modules:
1) rocoto/1.3.3 8) zlib/1.2.11 15) netcdf/4.7.4 22) ecbuild/ecmwf-3.6.1 29) json-schema-validator/2.1.0
2) contrib 9) udunits/2.2.28 16) boost-headers/1.68.0 23) eckit/ecmwf-1.16.0 30) jedi/intel-impi/2020.2/2020.2
3) git-lfs/2.11.0 10) gsl_lite/0.37.0 17) eigen/3.3.7 24) fckit/ecmwf-0.9.2 31) UFSDA/hera
4) intelpython/2021.3.0 11) impi/2020.2 18) bufr/noaa-emc-11.5.0 25) atlas/ecmwf-0.24.1
5) intel/2020.2 12) jedi-impi/2020.2 19) nccmp/1.8.7.0 26) nco/4.9.1
6) jedi-intel/2020.2 13) hdf5/1.12.0 20) pio/2.5.1-debug 27) pybind11/2.7.0
7) szip/2.1.1 14) pnetcdf/1.12.1 21) cmake/3.20.1 28) json/3.9.1
The ctest log file is /scratch1/NCEPDEV/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/build/Testing/Temporary/LastTest.log
I attempted running earlier in the day on Hera and ran into issues, but after git pulling some updates, I was able to build successfully on both Hera and Orion.
I ran ctest on both machines, loading UFSDA/{machine} first as suggested. 351 tests failed on Orion and 484 tests failed on Hera.
Thank you @RussTreadon-NOAA and @CatherineThomas-NOAA ! My first guess (but I'll check tomorrow) is that the tests that failed are the MPI tests.
The build script has:
export SLURM_ACCOUNT=${SLURM_ACCOUNT:-"da-cpu"}
export SALLOC_ACCOUNT=${SALLOC_ACCOUNT:-$SLURM_ACCOUNT}
export SBATCH_ACCOUNT=${SBATCH_ACCOUNT:-$SLURM_ACCOUNT}
export SLURM_QOS=${SLURM_QOS:-"debug"}
which tells SLURM what account, etc. to use. My guess is that those are not in your env and thus srun
cannot submit properly.
The build script now has an option to run the tests immediately after building if you call it like so:
run_tests='YES' ./build_UFSDA.sh
(thanks to @aerorahul for that suggestion)
Thanks @CoryMartin-NOAA (and @aerorahul!) for the option. Execution of run_tests='YES' ./build_UFSDA.sh
in my Hera working directory resulted in 992 out of 995 tests passing. The following tests failed
Total Test time (real) = 2347.85 sec
The following tests FAILED:
730 - test_ufo_obserror_assign_unittests (Failed)
795 - test_ufo_function_drawvaluefromfile (Failed)
809 - test_ufo_function_satwind_indiv_errors (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/NCEPDEV/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
I will examine the FAILED cases more closely tomorrow.
I have the same 3 tests fail for me on Hera that @RussTreadon-NOAA noted. Given that it is only 3 tests, I am inclined to say that it is something code/data related and we need not worry about it at this time. Although it is curious why these fail because my cron job has not indicated that any tests fail on Hera. The only difference I can think of is the JCSDA 'public' repositories vs the JCSDA-internal 'private' ones.
I reran on Orion with the inline testing option.
99% tests passed, 7 tests failed out of 995
The following tests FAILED:
562 - test_ufo_opr_atminterplay_ompsnp_npp_flipz (Failed)
577 - test_ufo_opr_vertinterp_radiosonde (Failed)
586 - test_ufo_opr_scatwindNeutralMetOffice (Failed)
588 - test_ufo_linopr_timeoper (Failed)
911 - test_femps_csgrid (Failed)
915 - fv3jedi_test_tier1_geometry_gfs (Failed)
919 - fv3jedi_test_tier1_geometry_2d (Failed)
Like the Hera case, I'm also inclined to thinking that this is not due to our setup, but the current repos themselves.
Thanks @CatherineThomas-NOAA I only had one test fail on Orion...
911 - test_femps_csgrid (Failed)
It had a segmentation fault when running it interactively. By rerunning it with ulimit -s unlimited
interactively, it passed.
@aerorahul should I add the ulimit -s unlimited
to the build/test script? I assume that is where that should go?
I observed the same behavior as @CoryMartin-NOAA did on Orion
Total Test time (real) = 3330.70 sec
The following tests FAILED:
911 - test_femps_csgrid (Failed)
Errors while running CTest
Orion-login-2[35] rtreadon$ ulimit -s unlimited
Orion-login-2[36] rtreadon$ ctest -R test_femps_csgrid
Test project /work/noaa/da/Russ.Treadon/git/JEDI-T2O/ufs-da-bundle/UFS-DA/build
Start 911: test_femps_csgrid
1/1 Test #911: test_femps_csgrid ................ Passed 7.99 sec
100% tests passed, 0 tests failed out of 1
Label Time Summary:
executable = 7.99 sec*proc (1 test)
femps = 7.99 sec*proc (1 test)
Total Test time (real) = 8.49 sec
Orion-login-2[37] rtreadon$
Thanks @CatherineThomas-NOAA I only had one test fail on Orion...
911 - test_femps_csgrid (Failed)
It had a segmentation fault when running it interactively. By rerunning it with
ulimit -s unlimited
interactively, it passed.@aerorahul should I add the
ulimit -s unlimited
to the build/test script? I assume that is where that should go?
Probably needed in the if-block for hera and orion.
Thinking more about this, why are we running tests in the build script? And if we indeed do want to, why do we care about running all the tests from eckit down to fv3-jedi? When a tag/branch is added to the workflow, we should ensure that it has been tested offline. JEDI allows us to do that.
I would remove the ctest option completely from this. If someone is doing development on the fv3-bundle, they can do these tests there.
@aerorahul well the default is to build without testing and it only runs the tests if the user explicitly tells it to. I see no harm in this as-is but will defer to what others think.
Since the default is to not run ctest upon installation, I'm fine with leaving the option. It's something I would use as part of installing the package on a new machine. That said, I get @aerorahul 's point. The name ./build_UFSDA.sh
indicates that the script builds (installs) the package. One can manually execute all or specific ctests after installation. Sorry, this isn't a clear keep it or remove it comment.
With the most recent commit, to build please run either:
BUILD_TARGET=orion ./build_UFSDA.sh
or BUILD_TARGET=hera ./build_UFSDA.sh
depending on platform
This PR adds 4 new files and a set of directories within a
UFS-DA
subdirectory to mimic how the GSI repository is set up.There are Lua modules for both Orion and Hera to load the JEDI stacks. There is a custom CMakeLists.txt so that there is no need to clone a 'bundle'. There is also a new
ush/build_UFSDA.sh
script that will do everything automatically for the user.