Open aekiss opened 4 years ago
ESMF build notes (for v8.0.1) are here: https://esmf-org.github.io/801branch_docs/ESMF_usrdoc/node9.html
We can't upgrade the openmpi version in conda because VDI doesn't have 4.0.2. We're trying to see if we can repackage ESMPY to point to the ESMF version in /g/data/hh5 and see if this would solve the openmpi inconsistency error.
Shouldn't you be setting ESMF_COMPILER=intel
so that it picks up the correct conifig ESMC_Conf.h
, ESMF_Conf.inc
and build_rules.mk
. I don't think you need to override the f90 compiler and linker.
You could also retry your own build of ESMF using openmpi/4.0.1 instead of 4.0.2.
Thanks, I'll try those suggestions.
FYI I also tried putting
module use /g/data/hh5/public/modules
module load conda/analysis3
module load esmf-nuWRF/7.1.0
module unload openmpi
module load openmpi/4.0.2
in /scratch/v45/aek156/bathymetry/tools/025_deg_test/access-om2/tools/make_remap_weights.sh
to force it to override openmpi/4.0.1
used in analysis3
and it failed differently (I guess that's progress?):
20200817 155945.165 INFO PET234 Running with ESMF Version 7.1.0r
20200817 155945.625 ERROR PET234 ESMF_Grid.F90:5222 ESMF_GridCreate Wrong argument specified - - Bad corner array in SCRIP file
20200817 155945.625 ERROR PET234 ESMF_Grid.F90:6480 ESMF_GridCreateFrmScrip Wrong argument specified - Internal subroutine call returned Error
20200817 155945.625 ERROR PET234 ESMF_Grid.F90:6141 ESMF_GridCreateFrmNCFile Wrong argument specified - Internal subroutine call returned Error
20200817 155945.625 ERROR PET234 ESMF_RegridWeightGen.F90:1298 ESMF_RegridWeightGenFile Wrong argument specified - Internal subroutine call returned Error
20200817 155945.626 INFO PET234 Finalizing ESMF
Why do you need to load module load conda/analysis3
?
analysis3
is needed for import netCDF4
in make_remap_weights.py
I think the analysis3 would be needed for netcdf4.
I'm using these modules:
1) pbs 2) ncview/2.1.7 3) conda/analysis27-18.10(analysis27) 4) nco/4.9.2 5) openmpi/4.0.2(default) 6) esmf-nuWRF/7.1.0
And am getting an error like:
[gadi-cpu-clx-0550:1274464:0:1274464] mm_ep.c:168 Fatal: Failed to attach to remote mmid:5474103322476718. Shared memory error ==== backtrace (tid:1274464) ====
This may not be the root cause though because the log files are showing:
20200817 220019.692 ERROR PET40 ESMF_Grid.F90:5222 ESMF_GridCreate Wrong argument specified - - Bad corner array in SCRIP file
I tried compiling ESMF 8.0.1 with openmpi 4.0.1 like so: /home/156/aek156/github/esmf-org/build_esmf.sh
.
I still get
[gadi-cpu-clx-2874.gadi.nci.org.au:904230] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079
[gadi-cpu-clx-2873.gadi.nci.org.au:1114242] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079
...
ERROR: Problem on processor 270 \n . Please see the PET*.RegridWeightGen.Log files for a traceback.\n ERROR: Problem on processor 234 \n . Please see the PET*.RegridWeightGen.Log files for a traceback.\n ERROR: Problem on processor 252 \n . Please see the PET*.RegridWeightGen.Log files for a traceback.\n'
and the relevant PET*.Log files contain
20200817 221425.005 ERROR PET270 ESMF_Grid.F90:5303 ESMF_GridCreate Wrong argument specified - - Bad corner array in SCRIP file
20200817 221425.005 ERROR PET270 ESMF_Grid.F90:6800 ESMF_GridCreateFrmScrip Wrong argument specified - Internal subroutine call returned Error
20200817 221425.005 ERROR PET270 ESMF_Grid.F90:6395 ESMF_GridCreateFrmNCFile Wrong argument specified - Internal subroutine call returned Error
20200817 221425.005 ERROR PET270 ESMF_RegridWeightGen.F90:1345 ESMF_RegridWeightGenFile Wrong argument specified - Internal subroutine call returned Error
I'm not sure fully how the conversion to SCRIP format is done (or what is really going on in make_remap_weights.py
) but the ocean_mosaic.nc
file here /g/data/ik11/inputs/access-om2/input_20200530_CHUCKABLE/mom_025deg/ocean_mosaic.nc
contain faulty contact information. It's suitable for 1 degree but not 0.25. All the processors that are crashing (234,252,270) look to be at the western end of grid at the tripole if a 18x16
decomposition is being used.
Well spotted @russfiedler, but I think the faulty mosaic info is an unrelated issue.
The three resolutions /g/data/ik11/inputs/access-om2/input_20200530/mom_*deg/ocean_mosaic.nc
have the same ncdump
output, but the binaries differ (I guess for unimportant reasons).
It seems this has been true for a long time, e.g. it is also the case for /g/data/ik11/inputs/access-om2/input_46fb3d3b/mom_*deg/ocean_mosaic.nc
which date from 2018, and we were able to make weights back then.
Also make_remap_weights
fails in the same way when I try it with 1 degree files.
Interesting - I'm able to make both weights (patch and conserve) at 1deg and 0.25deg it is just 0.1deg that has a problem.
ESMF is difficult to debug but I do have some code that writes out the weights files in a more friendly format so I'm going to use that to check things.
@nichannah that's great news - can you show me how you got conserve at 0.25deg to work? That's the case that's urgent right now.
I'm pretty sure that the logic for detecting corners is completely wrong for the tripolar case when the corners are located at the tripole. In the 1 degree case each cell has 2 corners at (-280,65). That means there will always be a match between 2 corners for adjacent points in the j direction. Lines 5278 onward in the latest version. See the first and third lbock below.
`
! See if it matches nbr to the below
matches=.false.
do j=1,4
if ((abs(cornerX2D(i,1)-cornerX2D(j,dim1+1))<tol) .and. &
(abs(cornerY2D(i,1)-cornerY2D(j,dim1+1))<tol)) then
matches=.true.
exit
endif
enddo
if (matches) cycle
DATA SET: ./tmpquzxlnom.nc
MOM tripolar
X: 0.5 to 4.5
Y: 0.5 to 2.5
Z: 289.5 to 291.5
Column 1: GLATW is GLAT[J=1:2] Column 2: GLONW is GLON[J=1:2] GLATW GLONW ---- K:290 Z: 290 ---- J:1 Y: 1 1 / 1: 65.0000000000000 -280.000000000000 2 / 2: 65.3886878432059 -279.623531881642 3 / 3: 65.3938523137515 -279.655621885413 4 / 4: 65.0000000000000 -280.000000000000 ---- J:2 Y: 2 1 / 1: 65.3886878432059 -279.623531881642 2 / 2: 65.7717864192223 -279.246866829249 3 / 3: 65.7819698972261 -279.311058765472 4 / 4: 65.3938523137515 -279.655621885413 ---- K:291 Z: 291 ---- J:1 Y: 1 1 / 1: 65.0000000000000 -280.000000000000 2 / 2: 65.3938523137515 -279.655621885413 3 / 3: 65.3985780727892 -279.688301798966 4 / 4: 65.0000000000000 -280.000000000000 ---- J:2 Y: 2 1 / 1: 65.3938523137515 -279.655621885413 2 / 2: 65.7819698972261 -279.311058765472 3 / 3: 65.7912867298377 -279.376432073229 4 / 4: 65.3985780727892 -279.688301798966
If you decompose such that the southern most row is south of the tripole then the problem is avoided. i.e. use a lot les processors. as I cant see an obvious way to set the layout. Alternatively, we could hack the code to move the checks to 1 point east by adding 1
to the second index of all the 2D arrays.
Hmmm, interesting. Was the logic the same in v7.1.0r? We were using v7.1.0r successfully on raijin for a couple of years:
https://github.com/COSIMA/access-om2/blame/master/tools/contrib/build_esmf_on_raijin.sh
so if that worked it seems odd that esmf-nuWRF/7.1.0
fails on Gadi....
@nichannah what version are you using?
Ok. As suspected, the problem was that checking the orientation of corners failed on domains where the south-east corner lies on the tripole. The corners to the north also lie on the tripole so there are always some matches and locating the TopCorner
fails. The trick is to move the check one point to the east or to make sure that the processor layout contains enough latitudes such that the northern domain extends further south than the tripole. This looks to be done automatically so the only real option is to reduce the number of PEs. This was verified by @aekiss I guess the old version generated domains differently or maybe the corner generation was different (not exact to 1e-15 for instance). Maybe the order was tested on PET=0
and then broadcast rather than testing on mod(PetNo,regDecomp(1)) == 0
and broadcasting to the local PET
s
I developed a quick fix for V8.0.1
that looks to work for 1deg
and 0.25deg` but may not be entirely compatible with older versions and other grids. I'll clean it up so that it tries to default method first and then attempts the modified method on failure.
See /scratch/v45/raf599/esmf
branch grid_gen
Awesome sleuthing @russfiedler :-) So it appears this was a longstanding bug in ESMF's src/Infrastructure/Grid/interface/ESMF_Grid.F90 but we got lucky on raijin due to the number of cpus we happened to use. I hit this bug on gadi because I'd increased ncpus from 256 to 288 in make_remap_weights.sh to suit the increased cores per node.
Hi @russfiedler and @aekiss this si presumably going to always carry over to the CICE grids and also to the lat_bnds where we have had just had issues in CMIP6. Have you tried any runs yest with these new grids are just testing out with these new weights. should I look elsewhere under GitHub for them.
Hi @ofa001, this is for the atm -> cice coupling. AFAIK the ACCESS-CM model is still using SCRIP, rather than ESMF. If you would like to try ESMF we have tools that can help with that.
For the record, Russ' tripole bug fix to ESMF_RegridWeightGen
is in https://github.com/COSIMA/esmf/tree/f536c3e
I've put an executable using Russ' tripole bug fix here:
/g/data/ik11/inputs/access-om2/bin/ESMF_RegridWeightGen_f536c3e12d
This was built with https://github.com/COSIMA/access-om2/blob/eb2dcde1148b84ed7c8a2bc9a1539ec5a42270d1/tools/contrib/build_fixed_esmf_on_gadi.sh
That build script makes an executable that doesn't link properly to libesmf.so, so I've replaced
/g/data/ik11/inputs/access-om2/bin/ESMF_RegridWeightGen_f536c3e12d
with a copy of /scratch/v45/raf599/esmf/apps/appsO/Linux.intel.64.openmpi.default/ESMF_RegridWeightGen
a libesmf.so
suitable for /g/data/ik11/inputs/access-om2/bin/ESMF_RegridWeightGen_f536c3e12d
is here
/g/data/ik11/inputs/access-om2/lib/libesmf.so
We don't have a version of
ESMF_RegridWeightGen
on Gadi that can generate conservative remapping weights. This is needed for updating the 0.25deg land mask: https://github.com/COSIMA/access-om2/issues/210.Gadi doesn't have the
esmf/7.1.0r-intel
module that was on raijin.ESMF is available via
but the conserve remap calculation fails with
when I run versions of
make_remap_weights.sh
andmake_remap_weights.py
in/scratch/v45/aek156/bathymetry/tools/025_deg_test/access-om2/tools/
that useESMF_RegridWeightGen
fromesmf-nuWRF/7.1.0
. They generate the 0.25 patch fileJRA55_MOM025_patch.nc
but fail when doing the conserve version. I understand this is an MPI problem in/g/data/hh5/public/apps/esmf-nuWRF/7.1.0/bin/ESMF_RegridWeightGen
. This was compiled with/apps/openmpi/4.0.2/lib/libmpi_cxx.so.40 (0x00007fe087a54000)
which is the defaultopenmpi/4.0.2
version picked up withmodule load openmpi
.I've also downloaded and built the latest (8.0.1) ESMF here:
/home/156/aek156/github/esmf-org/esmf
. I built this withWhen I ran
make_remap_weights.sh
with/home/156/aek156/github/esmf-org/esmf/apps/appsO/Linux.gfortran.64.openmpi.default/ESMF_RegridWeightGen
for 0.25 deg this worked for the patch weights but failed for the conserve weights with the same error as above. Building withdidn't help.
I have also tried following the instructions in https://github.com/COSIMA/access-om2/wiki/Technical-documentation#creating-remapping-weights with a new build script
/home/156/aek156/github/COSIMA/access-om2/tools/contrib/build_esmf_on_gadi.sh
based onbuild_esmf_on_raijin.sh
but I've been unable to get it to compile. It fails with multiple‘ESMCI_FortranStrLenArg’ has not been declared
errors such asI've tried gcc and different versions of the intel compilers to no avail.