Open nichannah opened 4 years ago
@nichannah I get a segfault running tests with a9e2883
export LIBACCESSOM2_DIR=$(pwd)
module load openmpi
cd tests/
./copy_test_data_from_gadi.sh
cd JRA55_IAF
rm -rf log ; mkdir log ; rm -f accessom2_restart_datetime.nml ; cp ../test_data/i2o.nc ./ ; cp ../test_data/o2i.nc ./
mpirun -np 1 $LIBACCESSOM2_DIR/build/bin/yatm.exe : -np 1 $LIBACCESSOM2_DIR/build/bin/ice_stub.exe : -np 1 $LIBACCESSOM2_DIR/build/bin/ocean_stub.exe
yields
YATM_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
OCEAN_STUB_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
ICE_STUB_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
mom5xx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
cicexx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
matmxx: LIBACCESSOM2_COMMIT_HASH=575fb04771e5442e19654c27b183e92d8b205f3f
[gadi-login-04:6441 :0:6441] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffdb447bbc0)
==== backtrace (tid: 6441) ====
0 0x0000000000012d80 .annobin_sigaction.c() sigaction.c:0
1 0x000000000069f46b m_attrvect_mp_sort__.V() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_AttrVect.F90:3455
2 0x000000000062f1f0 sort_() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrix.F90:2637
3 0x000000000062f1f0 sortpermute_() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrix.F90:2750
4 0x00000000006341b5 m_sparsematrixtomaps_mp_sparsematrixtoxglobalsegmap__() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrixToMaps.F90:150
5 0x00000000006338ab m_sparsematrixplus_mp_initdistributed__() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/Linux/build/lib/mctdir/mct/m_SparseMatrixPlus.F90:516
6 0x00000000005956c0 mod_oasis_coupler_mp_oasis_coupler_setup_.V() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/psmile/src/mod_oasis_coupler.F90:943
7 0x000000000044b023 mod_oasis_method_mp_oasis_enddef_.V() /home/156/aek156/github/COSIMA/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/lib/psmile/src/mod_oasis_method.F90:741
8 0x000000000041f3d1 coupler_mod_mp_coupler_init_end_() /home/156/aek156/github/COSIMA/libaccessom2/libcouple/src/coupler.F90:149
9 0x000000000040e74c MAIN__.V() /home/156/aek156/github/COSIMA/libaccessom2/ice_stub/src/ice.F90:109
10 0x000000000040ce22 main() ???:0
11 0x0000000000023813 __libc_start_main() ???:0
12 0x000000000040cd2e _start() ???:0
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
yatm.exe 00000000007F8834 Unknown Unknown Unknown
libpthread-2.28.s 00007F363EAA5D80 Unknown Unknown Unknown
mca_pml_ucx.so 00007F36251EFE20 Unknown Unknown Unknown
mca_pml_ucx.so 00007F36251F10F3 mca_pml_ucx_recv Unknown Unknown
libmpi.so.40.20.2 00007F363F0A86CD MPI_Recv Unknown Unknown
libmpi_mpifh.so 00007F363F386A10 pmpi_recv_ Unknown Unknown
yatm.exe 00000000006F66FD Unknown Unknown Unknown
yatm.exe 000000000066D8DA Unknown Unknown Unknown
yatm.exe 00000000005E5AF2 mod_oasis_coupler 1055 mod_oasis_coupler.F90
yatm.exe 000000000049A8B3 mod_oasis_method_ 741 mod_oasis_method.F90
yatm.exe 00000000004418E1 coupler_mod_mp_co 149 coupler.F90
yatm.exe 000000000040EF77 MAIN__.V 108 atm.F90
yatm.exe 000000000040D5E2 Unknown Unknown Unknown
libc-2.28.so 00007F363E4EE813 __libc_start_main Unknown Unknown
yatm.exe 000000000040D4EE Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
ocean_stub.exe 0000000000797A24 Unknown Unknown Unknown
libpthread-2.28.s 00007F2CC137BD80 Unknown Unknown Unknown
mca_pml_ucx.so 00007F2CAC0DF0F3 mca_pml_ucx_recv Unknown Unknown
libmpi.so.40.20.2 00007F2CC197E6CD MPI_Recv Unknown Unknown
libmpi_mpifh.so 00007F2CC1C5CA10 pmpi_recv_ Unknown Unknown
ocean_stub.exe 00000000006A226D Unknown Unknown Unknown
ocean_stub.exe 000000000061944A Unknown Unknown Unknown
ocean_stub.exe 0000000000591662 mod_oasis_coupler 1055 mod_oasis_coupler.F90
ocean_stub.exe 0000000000446043 mod_oasis_method_ 741 mod_oasis_method.F90
ocean_stub.exe 000000000041A3F1 coupler_mod_mp_co 149 coupler.F90
ocean_stub.exe 000000000040E539 MAIN__.V 78 ocean.F90
ocean_stub.exe 000000000040CE22 Unknown Unknown Unknown
libc-2.28.so 00007F2CC0DC4813 __libc_start_main Unknown Unknown
ocean_stub.exe 000000000040CD2E Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gadi-login-04 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
This is the sort of message that I was getting in my ports of CM4 etc. My guess is that a temp array is being created for aV%iAttr(iIndex(n),:)
. Try setting -heap-arrays
or heap-arrays 10
when compiling to put temp on the heap rather than the stack.
Thanks - I just tried -heap-arrays 10
but got the same error
Try just -heap-arrays
to put them all on the heap.
I just tried -heap-arrays
- still no luck
Seems to be working fine for me on express queue without having to invoke -heap-arrays
.
Hang on it's just crashed at the end in the ocean stub with a heap of warnings like
1580886787.462249] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8433000 was not returned to mpool ucp_am_bufs [1580886787.462270] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8435080 was not returned to mpool ucp_am_bufs [1580886787.462273] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x1460615b8040 was not returned to mpool ucp_am_bufs [1580886787.462275] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x1460615ba0c0 was not returned to mpool ucp_am_bufs
` 0 0x0000000000051959 ucs_fatal_error_message() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:36
1 0x0000000000051a36 ucs_fatal_error_format() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/debug/assert.c:52
2 0x00000000000562f0 ucs_mem_region_destroy_internal() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:200
3 0x000000000005c6c6 ucs_class_call_cleanup_chain() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/type/class.c:52
4 0x0000000000056f38 ucs_rcache_destroy() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucs/../../../src/ucs/memory/rcache.c:729
5 0x00000000000030f2 uct_knem_md_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/uct/sm/knem/../../../../../src/uct/sm/knem/knem_md.c:91 6 0x000000000000f1c9 ucp_free_resources() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:710
7 0x000000000000f1c9 ucp_cleanup() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/ucx/1.6.1/source/ucx-1.6.1/build/src/ucp/../../../src/ucp/core/ucp_context.c:1266
8 0x0000000000005bcc mca_pml_ucx_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:247
9 0x0000000000007909 mca_pml_ucx_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:82
10 0x00000000000582b9 mca_base_component_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:53
11 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:85
12 0x0000000000058345 mca_base_components_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_components_close.c:86
13 0x00000000000621da mca_base_framework_close() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/opal/mca/base/../../../../opal/mca/base/mca_base_framework.c:216
14 0x000000000004f479 ompi_mpi_finalize() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/gcc-opt/ompi/../../ompi/runtime/ompi_mpi_finalize.c:363
15 0x000000000004ac29 ompi_finalize_f() /home/900/z30_apps/builds/_UYaaG8i/0/nci/gadi-apps/openmpi/4.0.2/source/openmpi-4.0.2/intel-opt/ompi/mpi/fortran/mpif-h/profile/pfinalize_f.c:71
16 0x0000000000418cb0 accessom2_mod_mp_accessom2deinit() /scratch/p93/raf599/cosima/gaditest/libaccessom2/libcouple/src/accessom2.F90:839
17 0x000000000040ec0a MAIN__.V() /scratch/p93/raf599/cosima/gaditest/libaccessom2/ocean_stub/src/ocean.F90:114
18 0x000000000040ce22 main() ???:0 19 0x0000000000023813 __libc_start_main() ???:0 20 0x000000000040cd2e _start() ???:0 `
I also found this in the thousands of messages. A warning in rcache.c and a failed assertion which matches the trace.
[1580886787.458225] [gadi-cpu-clx-2901:94690:0] rcache.c:360 UCX WARN knem rcache device: destroying inuse region 0x1c85a20 [0x1d56c00..0x1e29b00] g- rw ref 1 cookie 10351893497382213308 addr 0x1d56c00 [1580886787.458245] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x887e080 was not returned to mpool ucp_am_bufs [1580886787.458248] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8880100 was not returned to mpool ucp_am_bufs [1580886787.458250] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8882180 was not returned to mpool ucp_am_bufs [1580886787.458263] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8884200 was not returned to mpool ucp_am_bufs [1580886787.458267] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8886280 was not returned to mpool ucp_am_bufs [1580886787.458270] [gadi-cpu-clx-2901:94688:0] mpool.c:38 UCX WARN object 0x8888300 was not returned to mpool ucp_am_bufs [gadi-cpu-clx-2901:94690:0:94690] rcache.c:200 Assertion
region->refcount == 0' failed
`
interesting. do you think that's a related problem or something else?
The problem seems to be a missing remap weights file.
oasis_coupler_setup DEBUG ci:read mapfile
../test_data/rmp_jra55_cice_conserve.nc
oasis_coupler_setup DEBUG ci: inquire mapfile
../test_data/rmp_jra55_cice_conserve.nc F
----**** ENTER oasis_coupler_genmap
------**** ENTER oasis_io_read_field_fromroot
oasis_io_read_field_fromroot ERROR: in filename grids.nc
oasis_io_read_field_fromroot abort by model : 2 proc : 0
This file doesn't exist:
../test_data/rmp_jra55_cice_conserve.nc
It should really give a more informative error message than that.
I've tried a couple of other remap ping files, but they appear to be the wrong size.
Anyone know where that file is, or how the namcouple should be altered to be consistent with the weights files that are there?
Isn't the problem that the copy_test_data_from_gadi.sh
script is getting rmp_jrar_to_cict_CONSERV.nc
to the directory test_data
but the namcouple
file is referencing rmp_jra55_cice_conserve.nc
? I can't see a link or renaming being done.
Here? /g/data/ik11/inputs/access-om2/input_08022019/common_1deg_jra55
These files map from a 640x320 grid to a 360x300 and have both the conserve and smooth versions as required by the namcouple file. They look like matches to the second order and patch files in /g/data/ik11/inputs/access-om2/input_rc/common_1deg_jra55
would it be a good idea to use the latest set of inputs from here? Some of the weights files have been renamed though.
/g/data/ik11/inputs/access-om2/input_20200530
@russfiedler the file named rmp_jrar_to_cict_CONSERV.nc
doesn't work either. I get this for all the remap files I have tried:
MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch
LocalMap Gsize = 204800 RemoteMap Gsize = 1036800
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3
MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch
LocalMap Gsize = 1036800 RemoteMap Gsize = 204800
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3
I assume this is, as it says, a mismatch and that this remap file is incompatible with the namcouple
Tried /g/data/ik11/inputs/access-om2/input_08022019/common_1deg_jra55/common_1deg_jra55/rmp_jra55_cice_conserve.nc
. Also doesn't work. Same error as above.
Lysdexia rules! Try /g/data/ik11/inputs/access-om2/input_08022019/common_1deg_jra55/rmp_jra55_cice_smooth.nc
. common_1deg_jra55
is repeated above.
Same problem:
MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch
LocalMap Gsize = 1036800 RemoteMap Gsize = 108000
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3
MCT::m_ExchangeMaps::ExGSMapGSMap_:: MCTERROR, Grid Size mismatch
LocalMap Gsize = 108000 RemoteMap Gsize = 1036800
MCT::m_ExchangeMaps::ExGSMapGSMap_: Map Grid Size mismatch error, stat =3
Different numbers though ... progress?
Looks like it's back to front?
It does doesn't it. I removed all the (optional?) size stuff from namcouple
as it wasn't used in the current configs, so thought it might be the issue.
~Oh, there aren't remapping weight files specified for some of the fields ... but there are for all the fields in the model versions.~
I was looking only at the atmosphere -> ice fields, which do require a remapping file.
I am getting the same error with the JRA55_IAF
test.
I got JRA55_IAF_SINGLE_FIELD
to run by copying the i2o.nc
and o2i.nc
files from /g/data/ik11/inputs/access-om2/input_20200530
. So that is a plus eh. If one works should be able to get the others going ...
If you're using input_20200530
you might need to use Nic's new namcouple
in the ak-dev
config branches - these are identical across the 6 configs
Well I don't know if I changed something or just got it wrong, but even JRA55_IAF_SINGLE_FIELD
isn't working, complaining about the grid size mismatch.
@nichannah I am working here:
/scratch/x77/aph502/scratch/libaccessom2/tests/JRA55_IAF_SINGLE_FIELD
Can you take a look and see if you can see the issue? I was about to dive into a debugger, but thought if you could see the problem easily then it would be a more productive approach.
I have fixed some of the tests. There were a few problems but the main one was that they did not use the new forcing.json field which I introduced to support JRA55 v1p4.
The FORCING_SCALING and JRA55_v1p4_IAF tests are still not working due to missing/wrong input files. I'll fix those.
I have merged a branch into master that fixes all the tests. However I have not set things up to run on Jenkins yet so keeping this issue open until we do that.
Awesome! I can do the Jenkins stuff if you don't have time, but am busy right now. Let me know if you do start working on it so we don't duplicate.
Looks like the non-ERA5 tests need to be updated for compatibility with the changes to the forcing.json
format, in particular https://github.com/COSIMA/libaccessom2/commit/a451a7f8d430077263575a0ea229f883b8d4a259 and https://github.com/COSIMA/libaccessom2/commit/467e3e2848ae50a770941325de0c2b2faa63d20d
... or alternatively, libaccessom2 could be made back-compatible with the older forcing.json
format - see https://github.com/COSIMA/libaccessom2/issues/75
master
and 242-era5-support
are now separate, divergent branches - see https://github.com/COSIMA/libaccessom2/issues/75.
Non-ERA5 tests should now work with master
, but ERA5 tests will fail.
Do https://github.com/COSIMA/access-om2/issues/182 for libaccessom2 tests