NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
187 stars 140 forks source link

Quickbuild tests #593

Closed hkershaw-brown closed 8 months ago

hkershaw-brown commented 8 months ago

Description:

building on Ann's previous pull request https://github.com/NCAR/DART/pull/575 improved build_everything, submits a job on Derecho for each compiler, each job runs every quickbuild.sh

time: ccc ~12minutes gfortran ~2:30 minutes nvhpc ~4:40 minutes ifort ~4:40 minutes

Note several of the converters require external libraries or code to be added before compiling, ignoring these for this pull request (rttov, hdfeos, wrf code, ncep prepbuf code):

intel RESULT: 1 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/developer_tests/forward_operators/work/ FAILED
intel RESULT: 10 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/GOES/work/ FAILED
intel RESULT: 11 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/gps/work/ FAILED
intel RESULT: 12 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/NSIDC/work/ FAILED
intel RESULT: 28 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/var/work/ FAILED
intel RESULT: 29 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/AIRS/work/ FAILED
intel RESULT: 31 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/quikscat/work/ FAILED
intel RESULT: 42 /glade/derecho/scratch/hkershaw/build_everything/intel/DART/observations/obs_converters/GMI/work/ FAILED

Other failures: #592 #352, #594

Types of changes

Documentation changes needed?

Tests

Derecho: build everything (that has a quickbuild.sh) in DART for cce, intel, gcc, nvhpc

Checklist for merging

Checklist for release

Testing Datasets

mjs2369 commented 8 months ago

The builds that use specific libraries such as rttov will be addressed in a future pull request

hkershaw-brown commented 8 months ago

In order to add ifx to the list of compilers to build with, we need to update fixsystem to include ifx

will do, thanks Marlee!

hkershaw-brown commented 8 months ago

I've added ifx. To run, you'll need to run with the quickbuild_tests branch because main does not have the mkmf.template.ifx.linux

./submit_jobs quickbuild_tests
mjs2369 commented 8 months ago

To run, you'll need to run with the quickbuild_tests branch because main does not have the mkmf.template.ifx.linux

./submit_jobs quickbuild_tests

^^^ This should be added to the README

And a follow up question @hkershaw-brown - if the submit_jobs.sh script submits a job for each compiler, are we expecting the ifx build to still run and just error out on all branches other than quickbuild_tests?

hkershaw-brown commented 8 months ago

To run, you'll need to run with the quickbuild_tests branch because main does not have the mkmf.template.ifx.linux

./submit_jobs quickbuild_tests

^^^ This should be added to the README

Nope don't add to the README. When this pull request is merged into main, mkmf.template.ifx.linux will exist on main. This 'To run' note was just for you for this pull request.

And a follow up question @hkershaw-brown - if the submit_jobs.sh script submits a job for each compiler, are we expecting the ifx build to still run and just error out on all branches other than quickbuild_tests?

Yes because mkmf.template.ifx.linux only exists on the quickbuild_tests branch.
Once quickbuild_tests is merged into main mkmf.template.ifx.linux will exist on main.

hkershaw-brown commented 8 months ago

Should we add some code for teardown? Currently, consecutive runs will fail with Directory exists: /glade/derecho/scratch/masmith/build_everything/nvhpc

This is the teardown: https://github.com/NCAR/DART/blob/a20971ad106b382c02056c6d269f57d50cfb423d/developer_tests/build_everything/run_all_quickbuilds.sh#L143

If you don't have a compiler.dateTime directory then the job did not finish. So I think at that point it is worth manually investigating why the job has failed, rather than rm -rf directories.

mjs2369 commented 8 months ago

To run, you'll need to run with the quickbuild_tests branch because main does not have the mkmf.template.ifx.linux

./submit_jobs quickbuild_tests

^^^ This should be added to the README

Nope don't add to the README. When this pull request is merged into main, mkmf.template.ifx.linux will exist on main. This 'To run' note was just for you for this pull request.

And a follow up question @hkershaw-brown - if the submit_jobs.sh script submits a job for each compiler, are we expecting the ifx build to still run and just error out on all branches other than quickbuild_tests?

Yes because mkmf.template.ifx.linux only exists on the quickbuild_tests branch. Once quickbuild_tests is merged into main mkmf.template.ifx.linux will exist on main.

That makes much more sense and is obvious in hindsight. For some reason, I thought you meant that mkmf.template.ifx.linux was not going to be added to main with this PR. Ignore this.

mjs2369 commented 8 months ago

Sometimes Derecho is unable to make a connection to the DART remote repository, causing some of the jobs to fail with this message:

Cloning into 'DART'...
fatal: unable to access 'https://github.com/NCAR/DART.git/': Failed to connect to github.com port 443 after 1 ms: Couldn't connect to server
./run_all_quickbuilds.sh: line 51: cd: DART: No such file or directory
fatal: not a git repository (or any parent up to mount point /glade/derecho)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /glade/derecho)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
unknown branch

I have submitted a support request with CISL Help to address this issue.

mjs2369 commented 8 months ago

Additional failures with cce:

Just making a note of these on this PR, as other failures are noted in the body, but we can create issues for these as well

masmith@derecho1:~/DART/developer_tests/build_everything> grep -a FAILED test-2/build-everything-cce.o2646766 
cce RESULT: 0 /glade/derecho/scratch/masmith/build_everything/cce/DART/models/clm/work/ FAILED
cce RESULT: 3 /glade/derecho/scratch/masmith/build_everything/cce/DART/models/noah/work/ FAILED
cce RESULT: 6 /glade/derecho/scratch/masmith/build_everything/cce/DART/models/wrf_hydro/work/ FAILED
cce RESULT: 12 /glade/derecho/scratch/masmith/build_everything/cce/DART/models/wrf/work/ FAILED

CLM:

ftn -O2  -I/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cce/15.0.1/cuko/include  -c /glade/u/home/masmith/DART/models/clm/dart_to_clm.f90

ftn-1569 ftn: WARNING UPDATE_SNOW, File = ../../../../../../u/home/masmith/DART/models/clm/dart_to_clm.f90, Line = 612, Column = 19 
  A DO loop variable or expression of type default real or double precision real is a deleted feature of the Fortran standard.

ftn-1569 ftn: WARNING UPDATE_SNOW, File = ../../../../../../u/home/masmith/DART/models/clm/dart_to_clm.f90, Line = 628, Column = 19 
  A DO loop variable or expression of type default real or double precision real is a deleted feature of the Fortran standard.

ftn-319 ftn: ERROR UPDATE_SNOW, File = ../../../../../../u/home/masmith/DART/models/clm/dart_to_clm.f90, Line = 752, Column = 76 
  A subscript must be a scalar integer expression.

Cray Fortran : Version 15.0.1 (20230120205242_66f7391d6a03cf932f321b9f6b1d8612ef5f362c)

Line in question - https://github.com/NCAR/DART/blob/74b4221e5f4e41e4de2980fc9ff8697ba4540a8b/models/clm/dart_to_clm.f90#L752

snlsno(ncolumn), which is the subscript for gain_dzsno, is a real(r8) - changed to integer (line 418) and it compiles
https://github.com/NCAR/DART/blob/74b4221e5f4e41e4de2980fc9ff8697ba4540a8b/models/clm/dart_to_clm.f90#L418

NOAH/WRF_HYDRO:


ftn -O2  -I/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cce/15.0.1/cuko/include  -c /glade/u/home/masmith/DART/models/wrf_hydro/noah_hydro_mod.f90
   Error message      ::  _expr_type: Invalid table type
   Error detected     ::  File '/home/jenkins/crayftn/pdgcs/v_expr_utl.c', line 7360
   Initiated from     ::  Line 1280 (v_main.c)
   Optimizer built    ::  2023-01-20 (production)

   File               ::  /glade/u/home/masmith/DART/models/wrf_hydro/noah_hydro_mod.f90
   Function           ::  getchannelgridcoords
   at or near line    ::  660

   Compiler hash      ::  66f7391d6a03cf932f321b9f6b1d8612ef5f362c
   Target             ::  x86-milan

ftn-7991 ftn: INTERNAL GETCHANNELGRIDCOORDS, File = ../../../../../../u/home/masmith/DART/models/wrf_hydro/noah_hydro_mod.f90, Line = 660 
  INTERNAL COMPILER ERROR:  "_expr_type: Invalid table type" (/home/jenkins/crayftn/pdgcs/v_expr_utl.c, line 7360, version 66f7391d6a03cf932f321b9f6b1d8612ef5f362c)

Line in question: https://github.com/NCAR/DART/blob/74b4221e5f4e41e4de2980fc9ff8697ba4540a8b/models/wrf_hydro/noah_hydro_mod.f90#L660

WRF:

Building  WRF_DART_utilities/add_pert_where_high_refl  build  18  of  29
............................................................................................... Makefile is ready.
ftn -O2  -I/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cce/15.0.1/cuko/include  -c /glade/u/home/masmith/DART/models/wrf/WRF_DART_utilities/add_pert_where_high_refl.f90

ftn-292 ftn: ERROR ADD_PERT_WHERE_HIGH_REFL, File = ../../../../../../u/home/masmith/DART/models/wrf/WRF_DART_utilities/add_pert_where_high_refl.f90, Line = 37, Column = 8 
  "F2KCLI" is specified as the module name on a USE statement, but the compiler cannot find it.

Cray Fortran : Version 15.0.1 (20230120205242_66f7391d6a03cf932f321b9f6b1d8612ef5f362c)

Solution is to remove the following line: use f2kcli https://github.com/NCAR/DART/blob/74b4221e5f4e41e4de2980fc9ff8697ba4540a8b/models/wrf/WRF_DART_utilities/add_pert_where_high_refl.f90#L37 Compiles after this change

hkershaw-brown commented 8 months ago

@mjs2369 https://github.com/NCAR/DART/issues/599 https://github.com/NCAR/DART/issues/598