COSIMA / access-om2

ACCESS-OM2 global ocean - sea ice coupled model configurations.
20 stars 23 forks source link

ESMF_RegridWeightGen not working for conservative fields #216

Open aekiss opened 4 years ago

aekiss commented 4 years ago

We don't have a version of ESMF_RegridWeightGen on Gadi that can generate conservative remapping weights. This is needed for updating the 0.25deg land mask: https://github.com/COSIMA/access-om2/issues/210.

Gadi doesn't have the esmf/7.1.0r-intel module that was on raijin.

ESMF is available via

module use /g/data/hh5/public/modules
module load esmf-nuWRF/7.1.0

but the conserve remap calculation fails with

[gadi-cpu-clx-2710.gadi.nci.org.au:1054756] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079

when I run versions of make_remap_weights.sh and make_remap_weights.py in /scratch/v45/aek156/bathymetry/tools/025_deg_test/access-om2/tools/ that use ESMF_RegridWeightGen from esmf-nuWRF/7.1.0. They generate the 0.25 patch file JRA55_MOM025_patch.nc but fail when doing the conserve version. I understand this is an MPI problem in /g/data/hh5/public/apps/esmf-nuWRF/7.1.0/bin/ESMF_RegridWeightGen. This was compiled with /apps/openmpi/4.0.2/lib/libmpi_cxx.so.40 (0x00007fe087a54000) which is the default openmpi/4.0.2 version picked up with module load openmpi.

I've also downloaded and built the latest (8.0.1) ESMF here: /home/156/aek156/github/esmf-org/esmf. I built this with

cd /home/156/aek156/github/esmf-org/esmf
export ESMF_DIR=`pwd`
export ESMF_NETCDF=nc-config
export ESMF_COMM=openmpi
module load gcc
module load netcdf
module load openmpi
gmake clobber
gmake
cd /home/156/aek156/github/esmf-org/esmf/src/apps/ESMF_RegridWeightGen
gmake

When I ran make_remap_weights.sh with /home/156/aek156/github/esmf-org/esmf/apps/appsO/Linux.gfortran.64.openmpi.default/ESMF_RegridWeightGen for 0.25 deg this worked for the patch weights but failed for the conserve weights with the same error as above. Building with

export ESMF_PIO=OFF

didn't help.

I have also tried following the instructions in https://github.com/COSIMA/access-om2/wiki/Technical-documentation#creating-remapping-weights with a new build script /home/156/aek156/github/COSIMA/access-om2/tools/contrib/build_esmf_on_gadi.sh based on build_esmf_on_raijin.sh but I've been unable to get it to compile. It fails with multiple ‘ESMCI_FortranStrLenArg’ has not been declared errors such as

/home/156/aek156/github/COSIMA/access-om2/tools/contrib/esmf/src/include/ESMCI_LogErr.h:146:29: error: ‘ESMCI_FortranStrLenArg’ has not been declared

I've tried gcc and different versions of the intel compilers to no avail.

aekiss commented 4 years ago

ESMF build notes (for v8.0.1) are here: https://esmf-org.github.io/801branch_docs/ESMF_usrdoc/node9.html

ccarouge commented 4 years ago

We can't upgrade the openmpi version in conda because VDI doesn't have 4.0.2. We're trying to see if we can repackage ESMPY to point to the ESMF version in /g/data/hh5 and see if this would solve the openmpi inconsistency error.

russfiedler commented 4 years ago

Shouldn't you be setting ESMF_COMPILER=intel so that it picks up the correct conifig ESMC_Conf.h, ESMF_Conf.inc and build_rules.mk. I don't think you need to override the f90 compiler and linker.

ccarouge commented 4 years ago

You could also retry your own build of ESMF using openmpi/4.0.1 instead of 4.0.2.

aekiss commented 4 years ago

Thanks, I'll try those suggestions.

FYI I also tried putting

module use /g/data/hh5/public/modules
module load conda/analysis3
module load esmf-nuWRF/7.1.0
module unload openmpi
module load openmpi/4.0.2

in /scratch/v45/aek156/bathymetry/tools/025_deg_test/access-om2/tools/make_remap_weights.sh to force it to override openmpi/4.0.1 used in analysis3 and it failed differently (I guess that's progress?):

20200817 155945.165 INFO             PET234 Running with ESMF Version 7.1.0r
20200817 155945.625 ERROR            PET234 ESMF_Grid.F90:5222 ESMF_GridCreate Wrong argument specified  - - Bad corner array in SCRIP file
20200817 155945.625 ERROR            PET234 ESMF_Grid.F90:6480 ESMF_GridCreateFrmScrip Wrong argument specified  - Internal subroutine call returned Error
20200817 155945.625 ERROR            PET234 ESMF_Grid.F90:6141 ESMF_GridCreateFrmNCFile Wrong argument specified  - Internal subroutine call returned Error
20200817 155945.625 ERROR            PET234 ESMF_RegridWeightGen.F90:1298 ESMF_RegridWeightGenFile Wrong argument specified  - Internal subroutine call returned Error
20200817 155945.626 INFO             PET234 Finalizing ESMF
aidanheerdegen commented 4 years ago

Why do you need to load module load conda/analysis3?

aekiss commented 4 years ago

analysis3 is needed for import netCDF4 in make_remap_weights.py

nichannah commented 4 years ago

I think the analysis3 would be needed for netcdf4.

I'm using these modules:

1) pbs 2) ncview/2.1.7 3) conda/analysis27-18.10(analysis27) 4) nco/4.9.2 5) openmpi/4.0.2(default) 6) esmf-nuWRF/7.1.0

And am getting an error like:

[gadi-cpu-clx-0550:1274464:0:1274464] mm_ep.c:168 Fatal: Failed to attach to remote mmid:5474103322476718. Shared memory error ==== backtrace (tid:1274464) ====

This may not be the root cause though because the log files are showing:

20200817 220019.692 ERROR PET40 ESMF_Grid.F90:5222 ESMF_GridCreate Wrong argument specified - - Bad corner array in SCRIP file

aekiss commented 4 years ago

I tried compiling ESMF 8.0.1 with openmpi 4.0.1 like so: /home/156/aek156/github/esmf-org/build_esmf.sh. I still get

[gadi-cpu-clx-2874.gadi.nci.org.au:904230] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079
[gadi-cpu-clx-2873.gadi.nci.org.au:1114242] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079
...
ERROR: Problem on processor          270 \n . Please see the PET*.RegridWeightGen.Log files for a traceback.\n ERROR: Problem on processor          234 \n . Please see the PET*.RegridWeightGen.Log files for a traceback.\n ERROR: Problem on processor          252 \n . Please see the PET*.RegridWeightGen.Log files for a traceback.\n'

and the relevant PET*.Log files contain

20200817 221425.005 ERROR            PET270 ESMF_Grid.F90:5303 ESMF_GridCreate Wrong argument specified  - - Bad corner array in SCRIP file
20200817 221425.005 ERROR            PET270 ESMF_Grid.F90:6800 ESMF_GridCreateFrmScrip Wrong argument specified  - Internal subroutine call returned Error
20200817 221425.005 ERROR            PET270 ESMF_Grid.F90:6395 ESMF_GridCreateFrmNCFile Wrong argument specified  - Internal subroutine call returned Error
20200817 221425.005 ERROR            PET270 ESMF_RegridWeightGen.F90:1345 ESMF_RegridWeightGenFile Wrong argument specified  - Internal subroutine call returned Error
russfiedler commented 4 years ago

I'm not sure fully how the conversion to SCRIP format is done (or what is really going on in make_remap_weights.py ) but the ocean_mosaic.nc file here /g/data/ik11/inputs/access-om2/input_20200530_CHUCKABLE/mom_025deg/ocean_mosaic.nc contain faulty contact information. It's suitable for 1 degree but not 0.25. All the processors that are crashing (234,252,270) look to be at the western end of grid at the tripole if a 18x16 decomposition is being used.

aekiss commented 4 years ago

Well spotted @russfiedler, but I think the faulty mosaic info is an unrelated issue.

The three resolutions /g/data/ik11/inputs/access-om2/input_20200530/mom_*deg/ocean_mosaic.nc have the same ncdump output, but the binaries differ (I guess for unimportant reasons).

It seems this has been true for a long time, e.g. it is also the case for /g/data/ik11/inputs/access-om2/input_46fb3d3b/mom_*deg/ocean_mosaic.nc which date from 2018, and we were able to make weights back then.

Also make_remap_weights fails in the same way when I try it with 1 degree files.

nichannah commented 4 years ago

Interesting - I'm able to make both weights (patch and conserve) at 1deg and 0.25deg it is just 0.1deg that has a problem.

ESMF is difficult to debug but I do have some code that writes out the weights files in a more friendly format so I'm going to use that to check things.

aekiss commented 4 years ago

@nichannah that's great news - can you show me how you got conserve at 0.25deg to work? That's the case that's urgent right now.

russfiedler commented 4 years ago

I'm pretty sure that the logic for detecting corners is completely wrong for the tripolar case when the corners are located at the tripole. In the 1 degree case each cell has 2 corners at (-280,65). That means there will always be a match between 2 corners for adjacent points in the j direction. Lines 5278 onward in the latest version. See the first and third lbock below.

`

! See if it matches nbr to the below
matches=.false.
do j=1,4
   if ((abs(cornerX2D(i,1)-cornerX2D(j,dim1+1))<tol) .and. &
       (abs(cornerY2D(i,1)-cornerY2D(j,dim1+1))<tol)) then
      matches=.true.
      exit
   endif
enddo
if (matches) cycle

         DATA SET: ./tmpquzxlnom.nc
         MOM tripolar
         X: 0.5 to 4.5
         Y: 0.5 to 2.5
         Z: 289.5 to 291.5

Column 1: GLATW is GLAT[J=1:2] Column 2: GLONW is GLON[J=1:2] GLATW GLONW ---- K:290 Z: 290 ---- J:1 Y: 1 1 / 1: 65.0000000000000 -280.000000000000 2 / 2: 65.3886878432059 -279.623531881642 3 / 3: 65.3938523137515 -279.655621885413 4 / 4: 65.0000000000000 -280.000000000000 ---- J:2 Y: 2 1 / 1: 65.3886878432059 -279.623531881642 2 / 2: 65.7717864192223 -279.246866829249 3 / 3: 65.7819698972261 -279.311058765472 4 / 4: 65.3938523137515 -279.655621885413 ---- K:291 Z: 291 ---- J:1 Y: 1 1 / 1: 65.0000000000000 -280.000000000000 2 / 2: 65.3938523137515 -279.655621885413 3 / 3: 65.3985780727892 -279.688301798966 4 / 4: 65.0000000000000 -280.000000000000 ---- J:2 Y: 2 1 / 1: 65.3938523137515 -279.655621885413 2 / 2: 65.7819698972261 -279.311058765472 3 / 3: 65.7912867298377 -279.376432073229 4 / 4: 65.3985780727892 -279.688301798966

If you decompose such that the southern most row is south of the tripole then the problem is avoided. i.e. use a lot les processors. as I cant see an obvious way to set the layout. Alternatively, we could hack the code to move the checks to 1 point east by adding 1 to the second index of all the 2D arrays.

aekiss commented 4 years ago

Hmmm, interesting. Was the logic the same in v7.1.0r? We were using v7.1.0r successfully on raijin for a couple of years: https://github.com/COSIMA/access-om2/blame/master/tools/contrib/build_esmf_on_raijin.sh so if that worked it seems odd that esmf-nuWRF/7.1.0 fails on Gadi....

@nichannah what version are you using?

russfiedler commented 4 years ago

Ok. As suspected, the problem was that checking the orientation of corners failed on domains where the south-east corner lies on the tripole. The corners to the north also lie on the tripole so there are always some matches and locating the TopCorner fails. The trick is to move the check one point to the east or to make sure that the processor layout contains enough latitudes such that the northern domain extends further south than the tripole. This looks to be done automatically so the only real option is to reduce the number of PEs. This was verified by @aekiss I guess the old version generated domains differently or maybe the corner generation was different (not exact to 1e-15 for instance). Maybe the order was tested on PET=0 and then broadcast rather than testing on mod(PetNo,regDecomp(1)) == 0 and broadcasting to the local PETs

I developed a quick fix for V8.0.1 that looks to work for 1deg and 0.25deg` but may not be entirely compatible with older versions and other grids. I'll clean it up so that it tries to default method first and then attempts the modified method on failure.

See /scratch/v45/raf599/esmf branch grid_gen

aekiss commented 4 years ago

Awesome sleuthing @russfiedler :-) So it appears this was a longstanding bug in ESMF's src/Infrastructure/Grid/interface/ESMF_Grid.F90 but we got lucky on raijin due to the number of cpus we happened to use. I hit this bug on gadi because I'd increased ncpus from 256 to 288 in make_remap_weights.sh to suit the increased cores per node.

ofa001 commented 4 years ago

Hi @russfiedler and @aekiss this si presumably going to always carry over to the CICE grids and also to the lat_bnds where we have had just had issues in CMIP6. Have you tried any runs yest with these new grids are just testing out with these new weights. should I look elsewhere under GitHub for them.

nichannah commented 4 years ago

Hi @ofa001, this is for the atm -> cice coupling. AFAIK the ACCESS-CM model is still using SCRIP, rather than ESMF. If you would like to try ESMF we have tools that can help with that.

aekiss commented 3 years ago

For the record, Russ' tripole bug fix to ESMF_RegridWeightGen is in https://github.com/COSIMA/esmf/tree/f536c3e

aekiss commented 2 years ago

I've put an executable using Russ' tripole bug fix here: /g/data/ik11/inputs/access-om2/bin/ESMF_RegridWeightGen_f536c3e12d

This was built with https://github.com/COSIMA/access-om2/blob/eb2dcde1148b84ed7c8a2bc9a1539ec5a42270d1/tools/contrib/build_fixed_esmf_on_gadi.sh

aekiss commented 2 years ago

That build script makes an executable that doesn't link properly to libesmf.so, so I've replaced /g/data/ik11/inputs/access-om2/bin/ESMF_RegridWeightGen_f536c3e12d with a copy of /scratch/v45/raf599/esmf/apps/appsO/Linux.intel.64.openmpi.default/ESMF_RegridWeightGen

aekiss commented 1 year ago

a libesmf.so suitable for /g/data/ik11/inputs/access-om2/bin/ESMF_RegridWeightGen_f536c3e12d is here /g/data/ik11/inputs/access-om2/lib/libesmf.so