Closed jpolton closed 3 years ago
I think we have good clean version of DOMAINcfg-MEs code to start with. This code is able to replicate the domain_cfg_MEs_L51_r24-07_opt_v2.nc we used in the JMMP_AMM7 paper.
The code is here: DOMAINcfg-MEs@15134.
Next step is to clean the python code to create the envelope. Almost there, will alert when ready.
Next step is to clean the python code to create the envelope. Almost there, will alert when ready.
Does the python code insert the hbatt_X
variables into bathy_meter.nc
?
Next step is to clean the python code to create the envelope. Almost there, will alert when ready.
Does the python code insert the
hbatt_X
variables intobathy_meter.nc
?
Yes, "hbatt_x" are the envelopes.
@oceandie Can you send me a link to an AMM7 version of namelist_cfg ideally with 2 envelopes and 51 levels (I think).
In particular, I am interested is seeing how we settled on the decomposition of number of levels between env1 and env2. (As I understand it so far, I'm assuming that nn_slev
is the only interesting/tuneable variable in &namzgr_mes
with the MEs)
(PS the python script seems to work fine :-) , but I've not 'reviewed' it yet as I wanted to see it fitting into the domain_cfg.nc building process)
@jpolton the following is the setup I used for JMMP-AMM7 work:
!-----------------------------------------------------------------------
&namrun ! parameters of the run
!-----------------------------------------------------------------------
nn_no = 0 ! job number (no more used...)
cn_exp = "domaincfg" ! experience name
nn_it000 = 1 ! first time step
nn_itend = 75 ! last time step (std 5475)
/
!-----------------------------------------------------------------------
&namcfg ! parameters of the configuration
!-----------------------------------------------------------------------
!
ln_e3_dep = .true. ! =T : e3=dk[depth] in discret sens.
! ! ===>>> will become the only possibility in v4.0
! ! =F : e3 analytical derivative of depth function
! ! only there for backward compatibility test with v3.6
! !
cp_cfg = 'amm7'
jp_cfg = 011
jperio = 0
jpidta = 297
jpiglo = 297
jpjdta = 375
jpjglo = 375
jpkdta = 51
jpizoom = 1
jpjzoom = 1
/
!-----------------------------------------------------------------------
&namzgr ! vertical coordinate
!-----------------------------------------------------------------------
ln_mes = .true. ! Multi-Envelope s-coordinate
ln_linssh = .true. ! linear free surface
/
!-----------------------------------------------------------------------
&namzgr_mes ! MEs-coordinate
!-----------------------------------------------------------------------
ln_envl = .TRUE. , .TRUE. , .FALSE. , .FALSE., .FALSE. ! (T/F) If the envelope is used
nn_strt = 2 , 1 , 1 , 1 , 1 ! Stretch. funct.: Madec 1996 (0) or
! Song & Haidvogel 1994 (1) or
! Siddorn & Furner 2012 (2)
nn_slev = 25 , 26 , 0 , 0 , 0 ! number of s-lev between env(n-1)
! and env(n)
rn_e_hc = 20.0 , 0.0 , 0.0 , 0.0 , 0.0 ! critical depth for transition to
! stretch. coord.
rn_e_th = 0.9 , 1.0 , 0.0 , 0.0 , 0.0 ! surf. control param.:
! SH94 or MD96: 0<=th<=20
! SF12: thickness surf. cell
rn_e_bb = -0.3 , 0.8 , 0.0 , 0.0 , 0.0 ! bot. control param.:
! SH94 or MD96: 0<=bb<=1
! SF12: offset for calculating Zb
rn_e_al = 4.4 , 0.0 , 0.0 , 0.0 , 0.0 ! alpha stretching param with SF12
rn_e_ba = 0.064, 0.0 , 0.0 , 0.0 , 0.0 ! SF12 bathymetry scaling factor for
! calculating Zb
rn_bot_min = 10.0 ! minimum depth of the ocean bottom (>0) (m)
rn_bot_max = 5600.0 ! maximum depth of the ocean bottom (= ocean depth) (>0) (m)
ln_loc_mes = .FALSE.
/
!-----------------------------------------------------------------------
&namdom ! space and time domain (bathymetry, mesh, timestep)
!-----------------------------------------------------------------------
nn_bathy = 1
nn_msh = 1
jphgr_msh = 0
ldbletanh = .TRUE.
ppglam0 = 999999.d0
ppgphi0 = 999999.d0
ppe1_deg = 999999.d0
ppe2_deg = 999999.d0
ppe1_m = 999999.d0
ppe2_m = 999999.d0
ppa0 = 103.9530096000000
ppa1 = 2.415951269000000
ppa2 = 100.7609285000000
ppacr = 7.0
ppacr2 = 13.0
ppdzmin = 999999.0
pphmax = 999999.0
ppkth = 15.35101370000000
ppkth2 = 48.02989372000000
ppsur = -3958.951371276829
rn_atfp = 0.1
rn_e3zps_min= 25.0
rn_e3zps_rat= 0.2
rn_hmin = -8.0
rn_rdt = 1350.0
/
I would say that nn_slev is surely one of the most important, as it is the case for every discretised domain. However, as in standard sigma/s-levels, also in MEs for every subdomain it is possible to choose and tune the stretching, which will also importantly affect how we reproduce particular processess (e.g. BBL dynamics).
Then particular attention has to be paid for odd subdomains, where lelves are automatically computed using cubic splines and monotonicy and continuity of the levels distribution and its first derivative - for this point, we can have a chat if you have some questions or doubts ... I hope this helps
For the python code to build the envelopes, do you have the velocity files I used to tune AMM7 domain ? ... if not I can put them on jasmin if you want to reproduce the exact copy fo JMMP-AMM7 MEs domain_cfg.nc.
For the python code to build the envelopes, do you have the velocity files I used to tune AMM7 domain ? ... if not I can put them on jasmin if you want to reproduce the exact copy fo JMMP-AMM7 MEs domain_cfg.nc.
No need. I am jumping straight into AMM15. So, doing first pass with empty list for velocity files.
@oceandie line 565 of http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/UKMO/tools_r4.0-HEAD_dev_MEs/DOMAINcfg/src/mes.F90?order=date&desc=1#L565
introduces a command
IF( lk_mpp ) CALL mppbcast_a_real(rn_ebot_max, max_nn_env, irnk )
that breaks my ARCHER2 run with opaque MPI errors. I can not see this command anywhere else in the NEMO source code so suspect I have not compiled it correctly for the picky Cray compiler. I will poke around to see if I can find an alternative command but perhaps from your experience of writing this you have some suggestions?
@jpolton I introduced the mppbcast_a_real subroutine in the DOMAINcfg code: I needed it in order to pass to all the proc the same global max depth of each envelope. It is used only for diagnostic, so it is not strtctly needed (we could actually comment that dignostic part if we cannot solve the problem).
If you could send me the error output I can try to see if I recognise any similarity with the errors I got during the development with our HPC ... but I don't gurantee I can solve it, I am still learning how to properly deal with mpi ... maybe James could help??
@jpolton I introduced the mppbcast_a_real subroutine in the DOMAINcfg code: I needed it in order to pass to all the proc the same global max depth of each envelope. It is used only for diagnostic, so it is not strtctly needed (we could actually comment that dignostic part if we cannot solve the problem).
If you could send me the error output I can try to see if I recognise any similarity with the errors I got during the development with our HPC ... but I don't gurantee I can solve it, I am still learning how to properly deal with mpi ... maybe James could help??
Here is the error:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
MPICH ERROR [Rank 1] [job id 426470.0] [Mon Aug 2 22:06:40 2021] [nid001015] - Abort(604599047) (rank 1 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdb2d34760, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdb2d34760, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142322] [nid001015:56637:0] mm_xpmem.c:82 UCX WARN remote segment id 20000dd3d apid 20000dd3d is not released, refcount 1
[1627938400.142338] [nid001015:56637:0] mm_xpmem.c:82 UCX WARN remote segment id 20000dd3c apid 10000dd3d is not released, refcount 1
[1627938400.147733] [nid001015:56636:0] mm_xpmem.c:82 UCX WARN remote segment id 20000dd3d apid 20000dd3c is not released, refcount 1
[1627938400.147744] [nid001015:56636:0] mm_xpmem.c:82 UCX WARN remote segment id 20000dd3c apid 10000dd3c is not released, refcount 1
MPICH ERROR [Rank 7] [job id 426470.0] [Mon Aug 2 22:06:40 2021] [nid001028] - Abort(738816775) (rank 7 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fffa45f5e20, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fffa45f5e20, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142338] [nid001028:58828:0] mm_xpmem.c:82 UCX WARN remote segment id 20000e5cd apid 20000e5cc is not released, refcount 1
[1627938400.142354] [nid001028:58828:0] mm_xpmem.c:82 UCX WARN remote segment id 20000e5cc apid 10000e5cc is not released, refcount 1
MPICH ERROR [Rank 5] [job id 426470.0] [Mon Aug 2 22:06:40 2021] [nid001027] - Abort(67728135) (rank 5 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffead771960, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffead771960, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142346] [nid001027:216092:0] mm_xpmem.c:82 UCX WARN remote segment id 200034c1d apid 200034c1c is not released, refcount 1
[1627938400.142362] [nid001027:216092:0] mm_xpmem.c:82 UCX WARN remote segment id 200034c1c apid 100034c1c is not released, refcount 1
[1627938400.142701] [nid001028:58829:0] mm_xpmem.c:82 UCX WARN remote segment id 20000e5cd apid 20000e5cd is not released, refcount 1
[1627938400.142716] [nid001028:58829:0] mm_xpmem.c:82 UCX WARN remote segment id 20000e5cc apid 10000e5cd is not released, refcount 1
[1627938400.142752] [nid001027:216093:0] mm_xpmem.c:82 UCX WARN remote segment id 200034c1d apid 200034c1d is not released, refcount 1
[1627938400.142767] [nid001027:216093:0] mm_xpmem.c:82 UCX WARN remote segment id 200034c1c apid 100034c1d is not released, refcount 1
MPICH ERROR [Rank 3] [job id 426470.0] [Mon Aug 2 22:06:40 2021] [nid001026] - Abort(873034503) (rank 3 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc24306200, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc24306200, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142345] [nid001026:71640:0] mm_xpmem.c:82 UCX WARN remote segment id 2000117d8 apid 2000117d8 is not released, refcount 1
[1627938400.142361] [nid001026:71640:0] mm_xpmem.c:82 UCX WARN remote segment id 2000117d7 apid 1000117d8 is not released, refcount 1
[1627938400.142778] [nid001026:71639:0] mm_xpmem.c:82 UCX WARN remote segment id 2000117d8 apid 2000117d7 is not released, refcount 1
[1627938400.142794] [nid001026:71639:0] mm_xpmem.c:82 UCX WARN remote segment id 2000117d7 apid 1000117d7 is not released, refcount 1
srun: error: nid001015: tasks 0-1: Exited with exit code 255
srun: Terminating job step 426470.0
srun: error: nid001027: tasks 4-5: Exited with exit code 255
srun: error: nid001026: tasks 2-3: Exited with exit code 255
srun: error: nid001028: tasks 6-7: Exited with exit code 255
I can't find any NEMO examples where this sort of function is performed (there must be some). I will try commenting out the diagnostics for now and see how it goes.
Hi @jpolton , I think I know what is the problem:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
I am passing a ranking which is negative. I do this on purpose in order to identify the processor where the global maximum occurs. This worked in our HPC (could be an option for you to run on Monsoon? ... I compiled it there and it wroks perfectly). I am sure there is a compilation flag to deal with this, or I can try to rewrite the code taking into account this ... let me work on it a bit! ;)
... in the meanwhile, I think that, as you said, if you comment the diagnostic part it should work ...
I'll let you know as soon as I'll have some news ...
@jpolton , actually it shouldn't happen that a negative ranking is passed to mppbcast_a_real: the subroutine mpp_max() should take care of it ... apparently it is not working on ARCHER2 ... could you please add
WRITE(numout,*) 'irnk = ', irnk
after L564 of mes.F90 and run the program?
mpp_max() is a standard NEMO subroutine, it should get the maximum of a scalar or an array (integer or real) ... I'll investigate a bit more in the meanwhile
Also, another think to try:
Could you please move
IF( lk_mpp ) CALL mppsync
after
IF( lk_mpp ) CALL mpp_max(irnk)
and try if this solve? ...
I will try the above. In the meanwhile I ran make_domain_cfg.exe
having commented out the diagnostics in mes.F90. I get meshmask????.nc output files. Looking at one of them I was surprised to not see any e3t etc. This is odd right?
ncdump -h mesh_mask_0004.nc
netcdf mesh_mask_0004 {
dimensions:
x = 365 ;
y = 672 ;
z = 51 ;
t = UNLIMITED ; // (1 currently)
variables:
float nav_lon(y, x) ;
float nav_lat(y, x) ;
float nav_lev(z) ;
double time_counter(t) ;
byte tmask(t, z, y, x) ;
byte umask(t, z, y, x) ;
byte vmask(t, z, y, x) ;
byte fmask(t, z, y, x) ;
byte tmaskutil(t, y, x) ;
// global attributes:
:DOMAIN_number_total = 8 ;
:DOMAIN_number = 4 ;
:DOMAIN_dimensions_ids = 1, 2 ;
:DOMAIN_size_global = 1458, 1345 ;
:DOMAIN_size_local = 365, 672 ;
:DOMAIN_position_first = 1, 674 ;
:DOMAIN_position_last = 365, 1345 ;
:DOMAIN_halo_size_start = 0, 0 ;
:DOMAIN_halo_size_end = 0, 0 ;
:DOMAIN_type = "BOX" ;
BTW the ARCHER2 queue is slow now that everyone is awake...
@jpolton yes this is odd indeed. Two explanations I can think of:
Would it be possible for you to share the bathy_meter.nc and coordinates.nc files you are using to build the domain_cfg.nc file with me on jasmin?
Regarding the previous mesh_mask query, the run did not finish properly. So that is probably that.
Ran with the new print statement and switched mppsync
and mpp_max
calls. It fails.
irnk = -1
Crash log:
less slurm-427418.out
MPICH ERROR [Rank 0] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001131] - Abort(738816775) (rank 0 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffff0af83c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffff0af83c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
irnk = -1
irnk = -1
irnk = -1
MPICH ERROR [Rank 6] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001206] - Abort(269054727) (rank 6 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd1d3c6bc0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd1d3c6bc0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
irnk = -1
MPICH ERROR [Rank 7] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001206] - Abort(134836999) (rank 7 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff8dd2e9c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff8dd2e9c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878449] [nid001206:244032:0] mm_xpmem.c:82 UCX WARN remote segment id 20003b940 apid 20003b940 is not released, refcount 1
[1627988704.878457] [nid001206:244032:0] mm_xpmem.c:82 UCX WARN remote segment id 20003b93f apid 10003b940 is not released, refcount 1
MPICH ERROR [Rank 2] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001197] - Abort(537490183) (rank 2 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff3cba9f00, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff3cba9f00, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
irnk = -1
MPICH ERROR [Rank 3] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001197] - Abort(1007252231) (rank 3 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff65f433e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff65f433e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878509] [nid001197:22671:0] mm_xpmem.c:82 UCX WARN remote segment id 200005890 apid 20000588f is not released, refcount 1
[1627988704.878515] [nid001197:22671:0] mm_xpmem.c:82 UCX WARN remote segment id 20000588f apid 10000588f is not released, refcount 1
MPICH ERROR [Rank 4] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001198] - Abort(67728135) (rank 4 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd48e652e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd48e652e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
irnk = -1
MPICH ERROR [Rank 5] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001198] - Abort(403272455) (rank 5 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc9884d8a0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc9884d8a0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878990] [nid001198:180810:0] mm_xpmem.c:82 UCX WARN remote segment id 20002c24b apid 20002c24a is not released, refcount 1
[1627988704.878996] [nid001198:180810:0] mm_xpmem.c:82 UCX WARN remote segment id 20002c24a apid 10002c24a is not released, refcount 1
[1627988704.879006] [nid001198:180811:0] mm_xpmem.c:82 UCX WARN remote segment id 20002c24b apid 20002c24b is not released, refcount 1
[1627988704.879011] [nid001198:180811:0] mm_xpmem.c:82 UCX WARN remote segment id 20002c24a apid 10002c24b is not released, refcount 1
[1627988704.878786] [nid001206:244031:0] mm_xpmem.c:82 UCX WARN remote segment id 20003b940 apid 20003b93f is not released, refcount 1
[1627988704.878791] [nid001206:244031:0] mm_xpmem.c:82 UCX WARN remote segment id 20003b93f apid 10003b93f is not released, refcount 1
[1627988704.878837] [nid001197:22672:0] mm_xpmem.c:82 UCX WARN remote segment id 200005890 apid 200005890 is not released, refcount 1
[1627988704.878841] [nid001197:22672:0] mm_xpmem.c:82 UCX WARN remote segment id 20000588f apid 100005890 is not released, refcount 1
irnk = -1
MPICH ERROR [Rank 1] [job id 427418.0] [Tue Aug 3 12:05:04 2021] [nid001131] - Abort(470381319) (rank 1 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdf4180340, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdf4180340, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878846] [nid001131:172482:0] mm_xpmem.c:82 UCX WARN remote segment id 20002a1c1 apid 10002a1c2 is not released, refcount 1
[1627988704.878853] [nid001131:172482:0] mm_xpmem.c:82 UCX WARN remote segment id 20002a1c2 apid 20002a1c2 is not released, refcount 1
srun: error: nid001197: tasks 2-3: Exited with exit code 255
srun: Terminating job step 427418.0
srun: error: nid001131: task 1: Exited with exit code 255
srun: error: nid001198: tasks 4-5: Exited with exit code 255
srun: error: nid001206: tasks 6-7: Exited with exit code 255
slurmstepd: error: *** STEP 427418.0 ON nid001131 CANCELLED AT 2021-08-03T12:05:05 ***
srun: error: nid001131: task 0: Terminated
srun: Force Terminated job step 427418.0
I will copy the bathymetry and coordinates files to JASMIN for you to have a go to see if it is a Jeff-error or an ARCHER-issue
@oceandie Boo hoo it is not working well for me and ARCHER2 is now painfully slow :-( I've copied the key files to JASMIN, perhaps you can see if it works on MONSooN:
/gws/nopw/j04/jmmp_collab/tmp_jelt_diego less README
amm15.coordinates.nc -- raw coordinates file
amm15.bathydepth.hook.nc -- raw bathy file
bathymetry.MEs_2env_0.24_0.07_opt.nc - output from running generate_envelopes.py, and some preprocessing steps in https://github.com/JMMP-Group/CO9_AMM15/wiki/Make-ME-domain-configuration-file
If you want to see what I've been doing it is on the branch ME_domaincfg
@jpolton OK, I will have a go now, I'll let you know how it goes ;)
Hi @jpolton,
I think I solved the problem:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
ncks -C -v Bathymetry amm15.bathydepth.hook.nc amm15.bathydepth.nc
in order to avoid xarray complaining about dimensions.
b. python generate_envelopes.py amm15_MEs_2env_0.24_0.07_opt.inp
, where amm15_MEs_2env_0.24_0.07_opt.inp is the input file adapted for AMM15. The output of this is bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc.Then I used this new bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc with the DOMAINcfg tool and I was able to create correct mesh_mask.nc and domain_cfg.nc files.
Maybe the pre-processing that you applied introduced something that disturbed the code ...
Anyway, I will put my files
on jasmin @ /home/users/dbruciaf/tmp_diego_amm15-MEs . Could you tell me please if you are able to create domain_cgf and mesh_mask file with the new input files and if they are exactly the same than mine?
Didn't mean to close this ticket with the python code merge.
@oceandie I have a permissions issue:
[jelt@xfer1 users]$ cd dbruciaf/
-bash: cd: dbruciaf/: Permission denied
Can you change them or put them in /gws/nopw/j04/jmmp_collab/
?
@oceandie I have a permissions issue:
[jelt@xfer1 users]$ cd dbruciaf/ -bash: cd: dbruciaf/: Permission denied
Can you change them or put them in
/gws/nopw/j04/jmmp_collab/
?
@jpolton, done, they are in /gws/nopw/j04/jmmp_collab/tmp_diego_amm15-MEs :)
Using your bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc
, a fresh checkout and build of make_domain_cfg.exe
I get the mesh_mask files (which I only got before after commenting out diagnostics in mes.F90). I also get further through the process. It looks like it stops arbitrarily in the lower envelope:
less ocean.output
...
-----------------------------------------------------------------
MEs-levels depths and scale factors
k gdepw1 e3w1 gdept1 e3t1
1, 0., 0.41666666666666663, 0.20833333333333331, 0.41666666666666663
2, 0.41666666666666663, 0.41666666666666669, 0.625, 0.41666666666666663
3, 0.83333333333333326, 0.41666666666666674, 1.0416666666666667, 0.41666666666666674
4, 1.25, 0.41666666666666674, 1.4583333333333335, 0.41666666666666652
5, 1.6666666666666665, 0.41666666666666652, 1.875, 0.41666666666666696
6, 2.0833333333333335, 0.41666666666666652, 2.2916666666666665, 0.41666666666666652
7, 2.5, 0.41666666666666652, 2.708333333333333, 0.41666666666666696
8, 2.916666666666667, 0.41666666666666696, 3.125, 0.41666666666666607
9, 3.333333333333333, 0.41666666666666696, 3.541666666666667, 0.41666666666666696
10, 3.75, 0.41666666666666607, 3.958333333333333, 0.41666666666666696
11, 4.166666666666667, 0.41666666666666696, 4.375, 0.41666666666666607
12, 4.583333333333333, 0.41666666666666696, 4.791666666666667, 0.41666666666666696
13, 5., 0.41666666666666696, 5.2083333333333339, 0.41666666666666607
14, 5.4166666666666661, 0.41666666666666607, 5.625, 0.41666666666666785
15, 5.8333333333333339, 0.41666666666666607, 6.0416666666666661, 0.41666666666666607
16, 6.25, 0.41666666666666785, 6.4583333333333339, 0.41666666666666607
17, 6.6666666666666661, 0.41666666666666607, 6.875, 0.41666666666666785
18, 7.0833333333333339, 0.41666666666666607, 7.2916666666666661, 0.41666666666666607
19, 7.5, 0.41666666666666785, 7.7083333333333339, 0.41666666666666607
20, 7.9166666666666661, 0.41666666666666607, 8.125, 0.41666666666666785
21, 8.3333333333333339, 0.41666666666666607, 8.5416666666666661, 0.41666666666666607
22, 8.75, 0.41666666666666785, 8.9583333333333339, 0.41666666666666607
23, 9.1666666666666661, 0.41666666666666607, 9.375, 0.41666666666666785
24, 9.5833333333333339, 0.41666666666666607, 9.7916666666666661, 0.41666666666666607
25, 10., 0.44893752898995309, 10.240604195656619, 0.59813355512192423
26, 10.598133555121924, 0.8346313245425474, 11.075235520199167, 1.0762266956483639
27, 11.674360250770288, 1.3225044819382106, 12.397740002137377, 1.5730088158643518
28, 13.24736906663464, 1.8272444801782672, 14.224984482315644, 2.0846786475108079
29, 15.332047714145448, 2.3447430306400285, 16.569727512955673, 2.6068364401277062
30, 17.938884154273154, 2.8703277399238729, 19.440055252879546, 3.1345591852538561
31, 21.07344333952701, 3.3988501207367001, 22.838905373616246, 3.662501010366185
32, 24.735944349893195, 3.9247977648547909, 26.763703138471037, 4.1850163260389728
33, 28.920960675932168, 4.4424274627107607, 31.206130601181798, 4.6963017275146051
34, 33.617262403446773, 4.9459145205562365, 36.152045121738034, 5.1905512022275531
35, 38.807813605674326, 5.4295121955568675, 41.581557317294902, 5.6621180172234915
36, 44.469931622897818, 5.8877141762826923, 47.469271493577594, 6.105
With a seg fault:
[nid001255:35553:0:35553] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ebd399b10)
==== backtrace (tid: 35553) ====
0 /opt/cray/cray-ucx/2.6.0-3/ucx/lib/libucs.so.0(ucs_handle_error+0x104) [0x2b0554339f74]
1 /opt/cray/cray-ucx/2.6.0-3/ucx/lib/libucs.so.0(+0x2435c) [0x2b055433a35c]
2 /opt/cray/cray-ucx/2.6.0-3/ucx/lib/libucs.so.0(+0x245c4) [0x2b055433a5c4]
3 /lib64/libc.so.6(+0x395a0) [0x2b0552b095a0]
4 /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x317a54]
5 /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x30262d]
6 /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x21bb08]
7 /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x2199a9]
8 /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x219988]
9 /lib64/libc.so.6(__libc_start_main+0xea) [0x2b0552af434a]
10 /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x2198ea]
=================================
srun: error: nid001255: task 0: Segmentation fault
srun: Terminating job step 428453.0
slurmstepd: error: *** STEP 428453.0 ON nid001255 CANCELLED AT 2021-08-03T16:55:08 ***
srun: error: nid001255: task 1: Terminated
srun: error: nid001256: task 3: Terminated
srun: error: nid001258: tasks 6-7: Terminated
srun: error: nid001257: tasks 4-5: Terminated
srun: error: nid001256: task 2: Terminated
srun: Force Terminated job step 428453.0
I wonder if it is something naive I've done with configuring the namelist_cfg file? Can you send your namelist_cfg? I've been using https://github.com/JMMP-Group/CO9_AMM15/blob/ME_domaincfg/BUILD_CFG/DOMAIN/ME_DOMAINcfg_namelist_cfg
Regarding differences between yours and my bathy+env file, you amm15_MEs_2env_0.24_0.07_opt.inp
file has
e_glo_rmx = [ 0.1, 0.07]
which I think should be
e_glo_rmx = [ 0.24, 0.07]
.
However with your *inp file I spotted that I perhaps should not specify velocity variables if not specifying velocity files. This led to ... (not sure yet as it is sitting in a queue and I don't want to lose this window)
@jpolton you can find the namelist I used in /gws/nopw/j04/jmmp_collab/tmp_diego_amm15-MEs/
.
Sorry, but it is not clear to me: did you have to comment the diagnostic this time or not? Would it be possible to send me your ocean.output please?
I think in the case of amm15 it should be
e_glo_rmx = [ 0.1, 0.07]
since the actual operational amm15 has rmax=0.1
@jpolton you can find the namelist I used in
/gws/nopw/j04/jmmp_collab/tmp_diego_amm15-MEs/
.Sorry, but it is not clear to me: did you have to comment the diagnostic this time or not? Would it be possible to send me your ocean.output please?
I think in the case of amm15 it should be
e_glo_rmx = [ 0.1, 0.07]
since the actual operational amm15 has rmax=0.1
I did the new experiments with a fresh mes.F90
with out commenting out the diagnostics.
The new tests stop within the diagnostics.
(I find this a bit odd but haven't put print statements in to see what is now happening in mes.F90
).
I attached my ocean.output.txt
I have updated the steps to use your namelist_cfg, and upper env rmax=0.1 in e_glo_rmx = [0.1, 0.07]
I have updated all my wiki recipe and repo branch:ME_domaincfg to replicate the latest of everything I did on ARCHER2.
I have run out of time on this so have passed the torch to @anwiseNOCL
Hi @jpolton,
OK, happy to support @anwiseNOCL if needed :)
However, looking at your ocean.output I can see that the problem is not anymore the call to mppbcast_a_real
, using a correct bathy+env seems to have solved the problem as on monsoon.
Also, comparing my ocean.output on monsoon and your from archer2, they are exaclty identical up to the point where the archer2 run fails, which is a simple printing ... the seg fault message doesn't help much, I would add some printing to debug ...
@jpolton @oceandie I've managed to get this working on ARCHER, but only with rmax=0.1 in the upper envelope. When I build with 0.24 the irnk variable is set to -1, so that means that IF ((mi0(eii)>1 .AND. mi0(eii)<jpi) .AND. (mj0(eij)>1 .AND. mj0(eij)<jpj)) THEN is evaluating to false. I don't really know what that means. When the diagnostics are on that throws the MPICH_ERROR, but when the diagnostics are commented a seg fault is thrown. so clearly that is not good.
I think I remember that AMM15 was going with the rmax=0.1 anyway, but @oceandie maybe worth you trying rmax=0.24 on Monsson to see if you get the same issue?
Should I hand the dom_cfg to @davbyr now so we can get the initial velocity fields. Hourly averages I think is what we had last time
Thanks @anwiseNOCL
Out of curiosity, when you try to build with 0.24, are you using the old bathy+env file of Jeff or did you create a new one?
I just realised that the files I passed to you have the wrong name:
We should change 0.24 with 0.10 to reflect the fact that the upper envelope uses rmax 0.10.
Well done @anwiseNOCL on getting that working. Could you update the wiki and branch to reflect any bugs you squashed in the workflow?
Before handing the domain_cfg.nc
file to @davbyr we probably out to make sure that a minimum depth of 10m is applied or he will certainly have trouble. But yes the next stage is to hand it on for HPG testing and velocity extraction on this P0.0 bathy version with ME coordinates.
Edit, missed thee new mesg.
@oceandie I created a new envelope file using your amm15.bathydepth.nc file from jasmin. Then that envelope file and the coordinates are the inputs for make_doman_cfg.exe. It works for r=0.1 then i change it to r0.24 recreate the envelope file and it throws the error.
@jpolton Yes i'll run through it again and update the wiki in the process. This 10 m depth thing. I just had rn_bot_min=10 so in my head this means that in the domain_cfg.nc the bathymetry >= 10. is that what we are after? I will check that this is what happens.
@jpolton , @anwiseNOCL
I think I solved the problem. Also on monsoon, when using rmax=0.24 I get MPI error. The problem was due to:
IF ((mi0(eii)>1 .AND. mi0(eii)<jpi) .AND. (mj0(eij)>1 .AND. mj0(eij)<jpj)) THEN
which should be really
IF ((mi0(eii)>=1 .AND. mi0(eii)<=jpi) .AND. (mj0(eij)>=1 .AND. mj0(eij)<=jpj)) THEN
In the case of AMM15 with rmax=0.24 the point where the global minimum is located is just on the boundary, so it wasn't picked by the old condition, breaking the code. With the new condition I am able to create domain_cfg.nc and mesh_mask.nc files also for AMM15_0.24.
The updated revision of the code is DOMAINcfg@15172.
I think this potentially also means that the pre-proc of Jeff wasn't introducing any error, but we should double-check.
Regarding the pre-proc to apply the min_depth to the bathymetry, I think we should avoid this step for two reasons:
@jpolton @oceandie Using rn_bot_min=10 gives bathy_meter.min() = 7.908 . I don't know why, or if this is a problem (is this something to do with W&D ?) But I checked with diego's file and it is the same.
@jpolton @oceandie Using rn_bot_min=10 gives bathy_meter.min() = 7.908 . I don't know why, or if this is a problem (is this something to do with W&D ?) But I checked with diego's file and it is the same.
Sorry I don't know why that is. (I think this is why I manually set a min depth using netcdf tools). Though surely it is nothing to do with W&D as that is not invoked at this stage of building.
Hi @anwiseNOCL @jpolton,
if you were checking the minimum of bathy_meter variable in the domain_cfg.nc file, please be aware that bathy_meter is not the simple raw bathymetry received in input with the minimum and maximum depths applied but is the model bathymetry computed in domain.f90
as follows:
DO jj = 1,jpj
DO ji = 1,jpi
z2d (ji,jj) = SUM ( e3t_0(ji,jj, 1:mbkt(ji,jj) ) ) * ssmask(ji,jj)
END DO
END DO
CALL iom_rstput( 0, 0, inum, 'bathy_meter' , z2d , ktype = jp_r4 )
so it is the sum of "wet" e3t and not the bathymetry used to generate the vertical grid. Also, consider that this sum will not be exactly equal to the total depth of the bathymetry when the envelope is deeper than the real bathymetry.
This bathy_meter variable in the domain_cfg file is part of the standard DOMAINcfg code. If you want I can modify it to output the real bathymetry used to generate the vertical grid to check that the minimum is applied.
@jpolton, @anwiseNOCL
I added the bathymetry in the domain_cfg.nc file. The revision of the code including this change is DOMAINcfg@15174.
If we check:
import xarray as xr
ds_dom = xr.open_dataset("domain_cfg.nc")
bathy = ds_dom["bathymetry"]
bathy = bathy.where(bathy>0)
bathy.min()
<xarray.DataArray 'bathymetry' ()>
array(10., dtype=float32)
@oceandie @jpolton What i was trying to ask @jpolton is why exactly we need the 10 m min depth? Is it a requirement for the W&D or something?
The reason I asked is that, if it is for W&D, then i would assume it needs to be a condition on the depth of the last wet w-level rather than the input bathymetry file? And as Diego says, they are not the same. So how to proceed?
@anwiseNOCL the 10m minimum depth is a crude parameterization we use to deal with large tidal excursions in absence of a W&D algorithm. I think the specific 10m value comes from AMM7 and then passed to AMM15, but I'm not sure, maybe @jpolton knows more ...
With the W&D, we will finally get rid of this parameterization (hopefully), which has also quite an important negative impact on tidal dynamics (see e.g. Jenny's paper).
@oceandie @anwiseNOCL Yes: The 10m minimum depth stops the regular model breaking on spring tides, where the amplitude approaches 10m. This was set by trial and error in the Old Days. Wetting and Drying functionality should mean we don't need to set a minimum depth.
For the sake of closure we can generate the minimum depth using either the preprocessing python method (a la @endaodea), or using the make_domain_cfg.exe namelist (a la @oceandie). Each have different effects. We have already started down the latter path so will probably stick with it. However in the future everyone will use Wetting and Drying so this distinction will not be so important. However it is important now to archive the domain_cfg.nc
file and state what we do.
Diego - tidying code on http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/UKMO/tools_r4.0-HEAD_dev_MEs?order=date&desc=1
Will alert all when done