COSIMA / libaccessom2

ACCESS-OM2 library
3 stars 7 forks source link

Tests working on gadi #36

Open nichannah opened 4 years ago

nichannah commented 4 years ago

Do https://github.com/COSIMA/access-om2/issues/182 for libaccessom2 tests

aekiss commented 4 years ago

@nichannah I get a segfault running tests with a9e2883

export LIBACCESSOM2_DIR=$(pwd)
module load openmpi
cd tests/
./copy_test_data_from_gadi.sh
cd JRA55_IAF
rm -rf log ; mkdir log ; rm -f accessom2_restart_datetime.nml ; cp ../test_data/i2o.nc ./ ; cp ../test_data/o2i.nc ./
mpirun -np 1 $LIBACCESSOM2_DIR/build/bin/yatm.exe : -np 1 $LIBACCESSOM2_DIR/build/bin/ice_stub.exe : -np 1 $LIBACCESSOM2_DIR/build/bin/ocean_stub.exe

yields

 YATM_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 OCEAN_STUB_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 ICE_STUB_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 mom5xx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 cicexx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
 matmxx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
[gadi-login-04:6441 :0:6441] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffdb447bbc0)
==== backtrace (tid:   6441) ====
 0 0x0000000000012d80 .annobin_sigaction.c()  sigaction.c:0
 1 0x000000000069f46b m_attrvect_mp_sort__.V()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_AttrVect.F90:3455
 2 0x000000000062f1f0 sort_()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrix.F90:2637
 3 0x000000000062f1f0 sortpermute_()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrix.F90:2750
 4 0x00000000006341b5 m_sparsematrixtomaps_mp_sparsematrixtoxglobalsegmap__()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrixToMaps.F90:150
 5 0x00000000006338ab m_sparsematrixplus_mp_initdistributed__()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrixPlus.F90:516
 6 0x00000000005956c0 mod_oasis_coupler_mp_oasis_coupler_setup_.V()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/psmile/src/mod_oasis_coupler.F90:943
 7 0x000000000044b023 mod_oasis_method_mp_oasis_enddef_.V()  /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/psmile/src/mod_oasis_method.F90:741
 8 0x000000000041f3d1 coupler_mod_mp_coupler_init_end_()  /home/156/aek156/github/COSIMA/libaccessom2/libcouple/src/coupler.F90:149
 9 0x000000000040e74c MAIN__.V()  /home/156/aek156/github/COSIMA/libaccessom2/ice_stub/src/ice.F90:109
10 0x000000000040ce22 main()  ???:0
11 0x0000000000023813 __libc_start_main()  ???:0
12 0x000000000040cd2e _start()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
yatm.exe           00000000007F8834  Unknown               Unknown  Unknown
libpthread-2.28.s  00007F363EAA5D80  Unknown               Unknown  Unknown
mca_pml_ucx.so     00007F36251EFE20  Unknown               Unknown  Unknown
mca_pml_ucx.so     00007F36251F10F3  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.20.2  00007F363F0A86CD  MPI_Recv              Unknown  Unknown
libmpi_mpifh.so    00007F363F386A10  pmpi_recv_            Unknown  Unknown
yatm.exe           00000000006F66FD  Unknown               Unknown  Unknown
yatm.exe           000000000066D8DA  Unknown               Unknown  Unknown
yatm.exe           00000000005E5AF2  mod_oasis_coupler        1055  mod_oasis_coupler.F90
yatm.exe           000000000049A8B3  mod_oasis_method_         741  mod_oasis_method.F90
yatm.exe           00000000004418E1  coupler_mod_mp_co         149  coupler.F90
yatm.exe           000000000040EF77  MAIN__.V                  108  atm.F90
yatm.exe           000000000040D5E2  Unknown               Unknown  Unknown
libc-2.28.so       00007F363E4EE813  __libc_start_main     Unknown  Unknown
yatm.exe           000000000040D4EE  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
ocean_stub.exe     0000000000797A24  Unknown               Unknown  Unknown
libpthread-2.28.s  00007F2CC137BD80  Unknown               Unknown  Unknown
mca_pml_ucx.so     00007F2CAC0DF0F3  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.20.2  00007F2CC197E6CD  MPI_Recv              Unknown  Unknown
libmpi_mpifh.so    00007F2CC1C5CA10  pmpi_recv_            Unknown  Unknown
ocean_stub.exe     00000000006A226D  Unknown               Unknown  Unknown
ocean_stub.exe     000000000061944A  Unknown               Unknown  Unknown
ocean_stub.exe     0000000000591662  mod_oasis_coupler        1055  mod_oasis_coupler.F90
ocean_stub.exe     0000000000446043  mod_oasis_method_         741  mod_oasis_method.F90
ocean_stub.exe     000000000041A3F1  coupler_mod_mp_co         149  coupler.F90
ocean_stub.exe     000000000040E539  MAIN__.V                   78  ocean.F90
ocean_stub.exe     000000000040CE22  Unknown               Unknown  Unknown
libc-2.28.so       00007F2CC0DC4813  __libc_start_main     Unknown  Unknown
ocean_stub.exe     000000000040CD2E  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gadi-login-04 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
russfiedler commented 4 years ago

This is the sort of message that I was getting in my ports of CM4 etc. My guess is that a temp array is being created for aV%iAttr(iIndex(n),:). Try setting -heap-arrays or heap-arrays 10 when compiling to put temp on the heap rather than the stack.

aekiss commented 4 years ago

Thanks - I just tried -heap-arrays 10 but got the same error

russfiedler commented 4 years ago

Try just -heap-arrays to put them all on the heap.

aekiss commented 4 years ago

I just tried -heap-arrays - still no luck

russfiedler commented 4 years ago

Seems to be working fine for me on express queue without having to invoke -heap-arrays.

Hang on it's just crashed at the end in the ocean stub with a heap of warnings like

1580886787.462249] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8433000 was not returned to mpool ucp_am_bufs [1580886787.462270] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8435080 was not returned to mpool ucp_am_bufs [1580886787.462273] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x1460615b8040 was not returned to mpool ucp_am_bufs [1580886787.462275] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x1460615ba0c0 was not returned to mpool ucp_am_bufs

` 0 0x0000000000051959 ucs_fatal_error_message() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:36

1 0x0000000000051a36 ucs_fatal_error_format() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:52

2 0x00000000000562f0 ucs_mem_region_destroy_internal() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:200

3 0x000000000005c6c6 ucs_class_call_cleanup_chain() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/type/class.c:52

4 0x0000000000056f38 ucs_rcache_destroy() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:729

5 0x00000000000030f2 uct_knem_md_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/uct/sm/knem/../../../../../src/uct/sm/knem/knem_md.c:91 6 0x000000000000f1c9 ucp_free_resources() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:710

7 0x000000000000f1c9 ucp_cleanup() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:1266

8 0x0000000000005bcc mca_pml_ucx_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:247

9 0x0000000000007909 mca_pml_ucx_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:82

10 0x00000000000582b9 mca_base_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:53

11 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:85

12 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:86

13 0x00000000000621da mca_base_framework_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_framework.c:216

14 0x000000000004f479 ompi_mpi_finalize() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/../../ompi/runtime/ompi_mpi_finalize.c:363

15 0x000000000004ac29 ompi_finalize_f() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/intel-opt/ompi/mpi/fortran/mpif-h/profile/pfinalize_f.c:71

16 0x0000000000418cb0 accessom2_mod_mp_accessom2deinit() /scratch/p93/raf599/cosima/gaditest/libaccessom2/libcouple/src/accessom2.F90:839

17 0x000000000040ec0a MAIN__.V() /scratch/p93/raf599/cosima/gaditest/libaccessom2/ocean_stub/src/ocean.F90:114

18 0x000000000040ce22 main() ???:0 19 0x0000000000023813 __libc_start_main() ???:0 20 0x000000000040cd2e _start() ???:0 `

I also found this in the thousands of messages. A warning in rcache.c and a failed assertion which matches the trace.

[1580886787.458225] [gadi-cpu-clx-2901:94690:0] rcache.c:360 UCX WARN knem rcache device: destroying inuse region 0x1c85a20 [0x1d56c00..0x1e29b00] g- rw ref 1 cookie 10351893497382213308 addr 0x1d56c00 [1580886787.458245] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x887e080 was not returned to mpool ucp_am_bufs [1580886787.458248] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8880100 was not returned to mpool ucp_am_bufs [1580886787.458250] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8882180 was not returned to mpool ucp_am_bufs [1580886787.458263] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8884200 was not returned to mpool ucp_am_bufs [1580886787.458267] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8886280 was not returned to mpool ucp_am_bufs [1580886787.458270] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8888300 was not returned to mpool ucp_am_bufs [gadi-cpu-clx-2901:94690:0:94690] rcache.c:200 Assertionregion->refcount == 0' failed `

aekiss commented 4 years ago

interesting. do you think that's a related problem or something else?

aidanheerdegen commented 4 years ago

The problem seems to be a missing remap weights file.

 oasis_coupler_setup DEBUG ci:read mapfile                                                                            
 ../test_data/rmp_jra55_cice_conserve.nc                                                                              
 oasis_coupler_setup DEBUG ci: inquire mapfile                                                                        
 ../test_data/rmp_jra55_cice_conserve.nc F                                                                            
 ----**** ENTER oasis_coupler_genmap                                                                                  
 ------**** ENTER oasis_io_read_field_fromroot                                                                        
 oasis_io_read_field_fromroot ERROR: in filename grids.nc                                                             
 oasis_io_read_field_fromroot abort by model :           2  proc :           0                                        

This file doesn't exist:

../test_data/rmp_jra55_cice_conserve.nc

It should really give a more informative error message than that.

I've tried a couple of other remap ping files, but they appear to be the wrong size.

Anyone know where that file is, or how the namcouple should be altered to be consistent with the weights files that are there?

russfiedler commented 4 years ago

Isn't the problem that the copy_test_data_from_gadi.sh script is getting rmp_jrar_to_cict_CONSERV.nc to the directory test_data but the namcouple file is referencing rmp_jra55_cice_conserve.nc? I can't see a link or renaming being done.

russfiedler commented 4 years ago

Here? /g/data/ik11/inputs/access-om2/input_08022019/common_1deg_jra55 These files map from a 640x320 grid to a 360x300 and have both the conserve and smooth versions as required by the namcouple file. They look like matches to the second order and patch files in /g/data/ik11/inputs/access-om2/input_rc/common_1deg_jra55

aekiss commented 4 years ago

would it be a good idea to use the latest set of inputs from here? Some of the weights files have been renamed though. /g/data/ik11/inputs/access-om2/input_20200530

aidanheerdegen commented 4 years ago

@russfiedler the file named rmp_jrar_to_cict_CONSERV.nc doesn't work either. I get this for all the remap files I have tried:

 MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch                                                 
 LocalMap Gsize =       204800  RemoteMap Gsize =      1036800                                                     
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3                                          
 MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch                                                 
 LocalMap Gsize =      1036800  RemoteMap Gsize =       204800                                                     
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3

I assume this is, as it says, a mismatch and that this remap file is incompatible with the namcouple

aidanheerdegen commented 4 years ago

Tried /g/data/ik11/inputs/access-om2/input_08022019/common_1deg_jra55/common_1deg_jra55/rmp_jra55_cice_conserve.nc. Also doesn't work. Same error as above.

russfiedler commented 4 years ago

Lysdexia rules! Try /g/data/ik11/inputs/access-om2/input_08022019/common_1deg_jra55/rmp_jra55_cice_smooth.nc. common_1deg_jra55 is repeated above.

aidanheerdegen commented 4 years ago

Same problem:

 MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch 
 LocalMap Gsize =      1036800  RemoteMap Gsize =       108000
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3
 MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch 
 LocalMap Gsize =       108000  RemoteMap Gsize =      1036800
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3

Different numbers though ... progress?

russfiedler commented 4 years ago

Looks like it's back to front?

aidanheerdegen commented 4 years ago

It does doesn't it. I removed all the (optional?) size stuff from namcouple as it wasn't used in the current configs, so thought it might be the issue.

aidanheerdegen commented 4 years ago

~Oh, there aren't remapping weight files specified for some of the fields ... but there are for all the fields in the model versions.~

I was looking only at the atmosphere -> ice fields, which do require a remapping file.

aidanheerdegen commented 4 years ago

I am getting the same error with the JRA55_IAF test.

I got JRA55_IAF_SINGLE_FIELD to run by copying the i2o.nc and o2i.nc files from /g/data/ik11/inputs/access-om2/input_20200530. So that is a plus eh. If one works should be able to get the others going ...

aekiss commented 4 years ago

If you're using input_20200530 you might need to use Nic's new namcouple in the ak-dev config branches - these are identical across the 6 configs

aidanheerdegen commented 4 years ago

Well I don't know if I changed something or just got it wrong, but even JRA55_IAF_SINGLE_FIELD isn't working, complaining about the grid size mismatch.

@nichannah I am working here:

/scratch/x77/aph502/scratch/libaccessom2/tests/JRA55_IAF_SINGLE_FIELD

Can you take a look and see if you can see the issue? I was about to dive into a debugger, but thought if you could see the problem easily then it would be a more productive approach.

nichannah commented 4 years ago

I have fixed some of the tests. There were a few problems but the main one was that they did not use the new forcing.json field which I introduced to support JRA55 v1p4.

The FORCING_SCALING and JRA55_v1p4_IAF tests are still not working due to missing/wrong input files. I'll fix those.

nichannah commented 4 years ago

I have merged a branch into master that fixes all the tests. However I have not set things up to run on Jenkins yet so keeping this issue open until we do that.

aidanheerdegen commented 4 years ago

Awesome! I can do the Jenkins stuff if you don't have time, but am busy right now. Let me know if you do start working on it so we don't duplicate.

aekiss commented 2 years ago

Looks like the non-ERA5 tests need to be updated for compatibility with the changes to the forcing.json format, in particular https://github.com/COSIMA/libaccessom2/commit/a451a7f8d430077263575a0ea229f883b8d4a259 and https://github.com/COSIMA/libaccessom2/commit/467e3e2848ae50a770941325de0c2b2faa63d20d

aekiss commented 2 years ago

... or alternatively, libaccessom2 could be made back-compatible with the older forcing.json format - see https://github.com/COSIMA/libaccessom2/issues/75

aekiss commented 2 years ago

master and 242-era5-support are now separate, divergent branches - see https://github.com/COSIMA/libaccessom2/issues/75.

Non-ERA5 tests should now work with master, but ERA5 tests will fail.