model runs crash under PGI compiler with FATAL calc_mosaic_grid_area: invalid types given. Arguments must be all r4_kind or r8_kind

nikizadehgfdl commented 2 years ago

Describe the bug MOM6-SIS2 tests crash under PGI with:

FATAL from PE     1: calc_mosaic_grid_area: invalid types given. Arguments must be all r4_kind or r8_kind

To Reproduce Steps to reproduce the behavior

Expected behavior Run any MOM6-SIS2 tests with pgi20. E.g.,

cd  /ncrc/home2/Niki.Zadeh/frerts/FMS2022.03-alpha1_mom6_20220703_FMS2/_070822/
fremake --execute -x /ncrc/home2/Niki.Zadeh/frerts/FMS2022.03-alpha1_mom6_20220703_FMS2/_070822/OM4_extra.xml.20220712163000 -p ncrc4.pgi19 -t prod MOM6_SIS2_compile_FMS2

frerun -s -u -r basic --no-transfer -x /ncrc/home2/Niki.Zadeh/frerts/FMS2022.03-alpha1_mom6_20220703_FMS2/_070822/OM4_extra.xml.20220712163000 -p ncrc4.pgi19 -t prod MOM6_SIS2_bergs_cgrid

System Environment module unload cray-netcdf cray-hdf5 fre module unload PrgEnv-pgi PrgEnv-intel PrgEnv-gnu PrgEnv-cray module load PrgEnv-pgi/6.0.5 module unload pgi module load pgi/19.10.0 module load fre/bronx-19 module load cray-hdf5/1.10.5.2

Additional context Add any other context about the problem. If applicable, include where any files that help describe, or reproduce the problem exist.

nikizadehgfdl commented 2 years ago

Same crash for pgi/20.4.0 on c4.

rem1776 commented 2 years ago

That check was added with #828, it uses a select type to make sure the 3 arguments are of the same real precision. Maybe one of the arguments ends up with the wrong real size with pgi?

The emc mixed mode updates adds class(*) and overloads to a few areas so that could potentially cause something, but that wouldn't really explain why it only happens with this compiler.

bensonr commented 2 years ago

The compilers being tested are 2-4 year old technology. Implementation of class(*) is still relatively new and I'd feel more comfortable if this were tested with the latest NVHPC toolkit (22.5).

nikizadehgfdl commented 2 years ago

When I try compiling with nvhpc22.5 on gpubox I get the following error:

pgfortran -Duse_netCDF -DMAXFIELDMETHODS_=400 -Duse_LARGEFILE  -I/opt/openmpi/4.1.4/NVHPC/22.5/include -I/opt/netcdf/4.9.0/NVHPC/22.5/include -i4 -r8 -byteswapio -Mcray=pointer -Mcray=pointer -Mflushz -Mdaz -D_F2000 -DNO_QUAD_PRECISION -O3 -Mvect=nosse -Mnoscalarsse -Mallocatable=95   -c -I/home/Niki.Zadeh/platforms/mom6/builds/../src/FMS/fms2_io/include    /home/Niki.Zadeh/platforms/mom6/builds/../src/FMS/fms2_io/netcdf_io.F90
NVFORTRAN-S-0155-Same name common blocks with different sizes in same file not supported  (/home/Niki.Zadeh/platforms/mom6/builds/../src/FMS/fms2_io/netcdf_io.F90)
  0 inform,   0 warnings,   1 severes, 0 fatal for write_restart_bc

What common block?

nikizadehgfdl commented 2 years ago

I can build the same code with gcc11 on gpubox.

nikizadehgfdl commented 2 years ago

Curiously, I removed the write_restart_bc subroutine from netcdf_io.F90 abd the MOM6-SIS2 model built without an error! It seems that subroutine is used only in test_fms/fms2_io/test_bc_restart.F90, right? Is any other component (LAND or FV3) using it? Isn't write_restart_bc subroutine supposed to replace fms2_io::save_restart_border()?

bensonr commented 2 years ago

FV3 uses write_restart_bc to output the time 't' and 't-1' data in the halo regions of a nest. You are correct that write_restart_bc is the fms2_io equivalent for the fms_io::save_restart_border function.

nikizadehgfdl commented 2 years ago

With nvhpc22.5 I get another runtime error as follows. I have no idea if this is happening before or after the call that made the error above with pgi20( Arguments must be all r4_kind or r8_kind).

FATAL from PE     0: character buffer is too small; increase length: netcdf_read_data_0d: file:INPUT/grid_spec.nc- variable:lnd_mosaic

In file INPUT/grid_spec.nc we have

lnd_mosaic = "land_mosaic" ;

gnu executable of the same code runs fine on the same machine (lscgpu50).

nikizadehgfdl commented 2 years ago

I think this might be a variation of the original issue. Here's the call stack as far as I can tell:

FMS/mosaic2/grid2.F90:            
  character(len=256) :: mosaic_name
  call read_data(gridfileobj, 'lnd_mosaic', mosaic_name)

FMS/fms2_io/include/netcdf_read_data.inc: 
  subroutine netcdf_read_data_0d(fileobj, variable_name, buf, unlim_dim_level,corner, broadcast)
    class(FmsNetcdfFile_t), intent(in) :: fileobj !< File object.
    character(len=*), intent(in) :: variable_name !< Variable name.
    class(*),  intent(inout) :: buf !< Array that the data will be read into.
    ...
    if (len(buf) .lt. dimsizes(1))  call error("character buffer is too small; increase length: "//trim(append_error_msg))

I put a print statement before the call error which showed it gets len(buf) = 0 , but it is passed in as a string on length 256.

thomas-robinson commented 2 years ago

This is an issue we have been seeing with the nvhpc and pgi compilers. The string size needs to be passed with the string into a routine where it is a class(*) variable. https://github.com/NOAA-GFDL/FMS/blob/main/fms2_io/include/netcdf_read_data.inc#L27-28 needs to include an argument for a string size in order to work with nvhpc.

bensonr commented 2 years ago

Do we have an issue in with NVidia? If not, could we create one for NVHPC v22.5 and I'll pass it on to our representative.

nikizadehgfdl commented 2 years ago

It seems all these issues happen after updating land_null to use FMS2 (mosaic2 etc). When I revert back land_null to use FMS1 (commit 35654c7) all these issues disappear, even with PGI20 on gaea. So, in that sense switching from pgi20 to nvhpc22.5 does not change anything and I cannot tell if any of the class(*) issues are resolved in the newer compilers. Other than MOM6-SIS2 tests, we do not have a small test case that demonstrates the compiler issues.

Note that atmos_null, MOM6 and SIS2 still use FMS2 (mosaic2) without a problem. It's only land_null that acts weird and had to be reverted.

nikizadehgfdl commented 2 years ago

Same name common blocks with different sizes

For what it's worth, nvhpc problem above in compiling write_restart_bc subroutine is with the following calls on lines 2240 and 2244 of netcdf_io.F90 inside write_restart_bc subroutine:

call register_variable_attribute(fileobj, fileobj%restart_vars(i)%varname, "checksum", &
             fileobj%restart_vars(i)%bc_info%chksum, str_len=len(fileobj%restart_vars(i)%bc_info%chksum))

This may again have to do with class(*) in the 0d and 1d interfaces.

nikizadehgfdl commented 1 year ago

This is still an issue in FMS2023.01_mm with pgi20 on c4.

FATAL from PE     0: calc_mosaic_grid_area: invalid types given. Arguments must be all r4_kind or r8_kind

nvidia/21.3 on c4 is broken. The module loads OK but cannot compile

ftn -Duse_libMPI -Duse_netCDF -DINTERNAL_FILE_NML -g -traceback -DMAXFIELDMETHODS_=400 -Duse_LARGEFILE   -i4 -r8 -byteswapio -Mcray=pointer -Mcray=pointer -Mflushz -Mdaz -D_F2000 -O3 -Mvect=nosse -Mnoscalarsse -Mallocatable=95   -c -I/lustre/f2/dev/Niki.Zadeh/fre/FMS2023.01_mm_mom6_20230109//MOM6_SIS2_compile_FMS2/src/FMS/include /lustre/f2/dev/Niki.Zadeh/fre/FMS2023.01_mm_mom6_20230109//MOM6_SIS2_compile_FMS2/src/FMS/platform/platform.F90
Error:
Unable to determine compiler version.
Make sure that a pgi module is loaded and that PGI_VERS_STR is defined

Setting the variable manually is no help either

ftn -Duse_libMPI -Duse_netCDF -DINTERNAL_FILE_NML -g -traceback -DMAXFIELDMETHODS_=400 -Duse_LARGEFILE   -i4 -r8 -byteswapio -Mcray=pointer -Mcray=pointer -Mflushz -Mdaz -D_F2000 -O3 -Mvect=nosse -Mnoscalarsse -Mallocatable=95   -c -I/lustre/f2/dev/Niki.Zadeh/fre/FMS2023.01_mm_mom6_20230109//MOM6_SIS2_compile_FMS2/src/FMS/include /lustre/f2/dev/Niki.Zadeh/fre/FMS2023.01_mm_mom6_20230109//MOM6_SIS2_compile_FMS2/src/FMS/platform/platform.F90
Child failed to exec: No such file or directory

c5 is not available to test the latest nvhpc/22.3 which I think did not have this issue.

mlee03 commented 1 year ago

The mixedmode updates for mosaic2 is not included in 2023.01_mm but it will be in the next mixedmode testing tag. This problem should be fixed then since the class(*) variables will be removed/replaced.

nikizadehgfdl commented 1 year ago

@mlee03 I tried this with 2023.01_mm and unlike issue #1062 this one is still an issue. The class(*) constructs are still in fms2_io/include/netcdf_read_data.inc under 2023.01_mm! Are they hopefully going to be removed?

mlee03 commented 1 year ago

@nikizadehgfdl Unfortunately we were planning on leaving the classs(*) variables in fms2_io...

NOAA-GFDL / FMS

model runs crash under PGI compiler with FATAL calc_mosaic_grid_area: invalid types given. Arguments must be all r4_kind or r8_kind #1007