JMMP-Group / CO_AMM15

Coastal Ocean (CO) configuration of the Atlantic Margin Model (1.5 km resolution)
4 stars 1 forks source link

2. Tidy DOMAINcfg/tools NEMO4 code #25

Closed jpolton closed 3 years ago

jpolton commented 3 years ago

Diego - tidying code on http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/UKMO/tools_r4.0-HEAD_dev_MEs?order=date&desc=1

Will alert all when done

oceandie commented 3 years ago

I think we have good clean version of DOMAINcfg-MEs code to start with. This code is able to replicate the domain_cfg_MEs_L51_r24-07_opt_v2.nc we used in the JMMP_AMM7 paper.

The code is here: DOMAINcfg-MEs@15134.

oceandie commented 3 years ago

Next step is to clean the python code to create the envelope. Almost there, will alert when ready.

jpolton commented 3 years ago

Next step is to clean the python code to create the envelope. Almost there, will alert when ready.

Does the python code insert the hbatt_X variables into bathy_meter.nc?

oceandie commented 3 years ago

Next step is to clean the python code to create the envelope. Almost there, will alert when ready.

Does the python code insert the hbatt_X variables into bathy_meter.nc?

Yes, "hbatt_x" are the envelopes.

jpolton commented 3 years ago

@oceandie Can you send me a link to an AMM7 version of namelist_cfg ideally with 2 envelopes and 51 levels (I think).

In particular, I am interested is seeing how we settled on the decomposition of number of levels between env1 and env2. (As I understand it so far, I'm assuming that nn_slev is the only interesting/tuneable variable in &namzgr_mes with the MEs)

(PS the python script seems to work fine :-) , but I've not 'reviewed' it yet as I wanted to see it fitting into the domain_cfg.nc building process)

oceandie commented 3 years ago

@jpolton the following is the setup I used for JMMP-AMM7 work:

!-----------------------------------------------------------------------
&namrun        !   parameters of the run
!-----------------------------------------------------------------------
   nn_no       =       0   !  job number (no more used...)
   cn_exp      =  "domaincfg"  !  experience name
   nn_it000    =       1   !  first time step
   nn_itend    =      75   !  last  time step (std 5475)
/
!-----------------------------------------------------------------------
&namcfg        !   parameters of the configuration
!-----------------------------------------------------------------------
   !
   ln_e3_dep   = .true.    ! =T : e3=dk[depth] in discret sens.
   !                       !      ===>>> will become the only possibility in v4.0
   !                       ! =F : e3 analytical derivative of depth function
   !                       !      only there for backward compatibility test with v3.6
   !                       !
   cp_cfg      = 'amm7'
   jp_cfg      = 011
   jperio      = 0
   jpidta      = 297
   jpiglo      = 297
   jpjdta      = 375
   jpjglo      = 375
   jpkdta      = 51
   jpizoom     = 1
   jpjzoom     = 1
/
!-----------------------------------------------------------------------
&namzgr        !   vertical coordinate
!-----------------------------------------------------------------------
   ln_mes      = .true.    !  Multi-Envelope s-coordinate
   ln_linssh   = .true.    !  linear free surface
/
!-----------------------------------------------------------------------
&namzgr_mes    !   MEs-coordinate
!-----------------------------------------------------------------------
   ln_envl     =   .TRUE. , .TRUE. , .FALSE. , .FALSE., .FALSE.  ! (T/F) If the envelope is used
   nn_strt     =     2    ,   1    ,    1   ,   1    ,   1     ! Stretch. funct.: Madec 1996 (0) or
                                                               ! Song & Haidvogel 1994 (1) or                 
                                                               ! Siddorn & Furner 2012 (2)
   nn_slev     =    25    ,   26   ,   0    ,   0    ,   0     ! number of s-lev between env(n-1)
                                                               ! and env(n)
   rn_e_hc     =    20.0  ,    0.0 ,   0.0  ,   0.0  ,   0.0   ! critical depth for transition to
                                                               ! stretch. coord.
   rn_e_th     =     0.9  ,    1.0 ,   0.0  ,   0.0  ,   0.0   ! surf. control param.:
                                                               ! SH94 or MD96: 0<=th<=20
                                                               ! SF12: thickness surf. cell
   rn_e_bb     =    -0.3  ,    0.8 ,   0.0  ,   0.0  ,   0.0   ! bot. control param.:
                                                               ! SH94 or MD96: 0<=bb<=1
                                                               ! SF12: offset for calculating Zb
   rn_e_al     =     4.4  ,    0.0 ,   0.0  ,   0.0  ,   0.0   ! alpha stretching param with SF12
   rn_e_ba     =     0.064,    0.0 ,   0.0  ,   0.0  ,   0.0   ! SF12 bathymetry scaling factor for
                                                               ! calculating Zb
   rn_bot_min  = 10.0       ! minimum depth of the ocean bottom (>0) (m)
   rn_bot_max  = 5600.0     ! maximum depth of the ocean bottom (= ocean depth) (>0) (m)

   ln_loc_mes  = .FALSE.
/
!-----------------------------------------------------------------------
&namdom        !   space and time domain (bathymetry, mesh, timestep)
!-----------------------------------------------------------------------
   nn_bathy    =  1
   nn_msh      =  1
   jphgr_msh   =  0
   ldbletanh   = .TRUE.
   ppglam0     =  999999.d0
   ppgphi0     =  999999.d0
   ppe1_deg    =  999999.d0
   ppe2_deg    =  999999.d0
   ppe1_m      =  999999.d0
   ppe2_m      =  999999.d0
   ppa0        =  103.9530096000000
   ppa1        =  2.415951269000000
   ppa2        =  100.7609285000000
   ppacr       =  7.0
   ppacr2      =  13.0
   ppdzmin     =  999999.0
   pphmax      =  999999.0
   ppkth       =  15.35101370000000
   ppkth2      =  48.02989372000000
   ppsur       = -3958.951371276829
   rn_atfp     =  0.1
   rn_e3zps_min=  25.0
   rn_e3zps_rat=  0.2
   rn_hmin     = -8.0
   rn_rdt      =  1350.0
/

I would say that nn_slev is surely one of the most important, as it is the case for every discretised domain. However, as in standard sigma/s-levels, also in MEs for every subdomain it is possible to choose and tune the stretching, which will also importantly affect how we reproduce particular processess (e.g. BBL dynamics).

Then particular attention has to be paid for odd subdomains, where lelves are automatically computed using cubic splines and monotonicy and continuity of the levels distribution and its first derivative - for this point, we can have a chat if you have some questions or doubts ... I hope this helps

oceandie commented 3 years ago

For the python code to build the envelopes, do you have the velocity files I used to tune AMM7 domain ? ... if not I can put them on jasmin if you want to reproduce the exact copy fo JMMP-AMM7 MEs domain_cfg.nc.

jpolton commented 3 years ago

For the python code to build the envelopes, do you have the velocity files I used to tune AMM7 domain ? ... if not I can put them on jasmin if you want to reproduce the exact copy fo JMMP-AMM7 MEs domain_cfg.nc.

No need. I am jumping straight into AMM15. So, doing first pass with empty list for velocity files.

jpolton commented 3 years ago

@oceandie line 565 of http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/branches/UKMO/tools_r4.0-HEAD_dev_MEs/DOMAINcfg/src/mes.F90?order=date&desc=1#L565 introduces a command IF( lk_mpp ) CALL mppbcast_a_real(rn_ebot_max, max_nn_env, irnk ) that breaks my ARCHER2 run with opaque MPI errors. I can not see this command anywhere else in the NEMO source code so suspect I have not compiled it correctly for the picky Cray compiler. I will poke around to see if I can find an alternative command but perhaps from your experience of writing this you have some suggestions?

oceandie commented 3 years ago

@jpolton I introduced the mppbcast_a_real subroutine in the DOMAINcfg code: I needed it in order to pass to all the proc the same global max depth of each envelope. It is used only for diagnostic, so it is not strtctly needed (we could actually comment that dignostic part if we cannot solve the problem).

If you could send me the error output I can try to see if I recognise any similarity with the errors I got during the development with our HPC ... but I don't gurantee I can solve it, I am still learning how to properly deal with mpi ... maybe James could help??

jpolton commented 3 years ago

@jpolton I introduced the mppbcast_a_real subroutine in the DOMAINcfg code: I needed it in order to pass to all the proc the same global max depth of each envelope. It is used only for diagnostic, so it is not strtctly needed (we could actually comment that dignostic part if we cannot solve the problem).

If you could send me the error output I can try to see if I recognise any similarity with the errors I got during the development with our HPC ... but I don't gurantee I can solve it, I am still learning how to properly deal with mpi ... maybe James could help??

Here is the error:

PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
MPICH ERROR [Rank 1] [job id 426470.0] [Mon Aug  2 22:06:40 2021] [nid001015] - Abort(604599047) (rank 1 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdb2d34760, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdb2d34760, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142322] [nid001015:56637:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000dd3d apid 20000dd3d is not released, refcount 1
[1627938400.142338] [nid001015:56637:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000dd3c apid 10000dd3d is not released, refcount 1
[1627938400.147733] [nid001015:56636:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000dd3d apid 20000dd3c is not released, refcount 1
[1627938400.147744] [nid001015:56636:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000dd3c apid 10000dd3c is not released, refcount 1
MPICH ERROR [Rank 7] [job id 426470.0] [Mon Aug  2 22:06:40 2021] [nid001028] - Abort(738816775) (rank 7 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fffa45f5e20, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fffa45f5e20, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142338] [nid001028:58828:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000e5cd apid 20000e5cc is not released, refcount 1
[1627938400.142354] [nid001028:58828:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000e5cc apid 10000e5cc is not released, refcount 1
MPICH ERROR [Rank 5] [job id 426470.0] [Mon Aug  2 22:06:40 2021] [nid001027] - Abort(67728135) (rank 5 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffead771960, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffead771960, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142346] [nid001027:216092:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 200034c1d apid 200034c1c is not released, refcount 1
[1627938400.142362] [nid001027:216092:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 200034c1c apid 100034c1c is not released, refcount 1
[1627938400.142701] [nid001028:58829:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000e5cd apid 20000e5cd is not released, refcount 1
[1627938400.142716] [nid001028:58829:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000e5cc apid 10000e5cd is not released, refcount 1
[1627938400.142752] [nid001027:216093:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 200034c1d apid 200034c1d is not released, refcount 1
[1627938400.142767] [nid001027:216093:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 200034c1c apid 100034c1d is not released, refcount 1
MPICH ERROR [Rank 3] [job id 426470.0] [Mon Aug  2 22:06:40 2021] [nid001026] - Abort(873034503) (rank 3 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc24306200, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc24306200, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627938400.142345] [nid001026:71640:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 2000117d8 apid 2000117d8 is not released, refcount 1
[1627938400.142361] [nid001026:71640:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 2000117d7 apid 1000117d8 is not released, refcount 1
[1627938400.142778] [nid001026:71639:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 2000117d8 apid 2000117d7 is not released, refcount 1
[1627938400.142794] [nid001026:71639:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 2000117d7 apid 1000117d7 is not released, refcount 1
srun: error: nid001015: tasks 0-1: Exited with exit code 255
srun: Terminating job step 426470.0
srun: error: nid001027: tasks 4-5: Exited with exit code 255
srun: error: nid001026: tasks 2-3: Exited with exit code 255
srun: error: nid001028: tasks 6-7: Exited with exit code 255

I can't find any NEMO examples where this sort of function is performed (there must be some). I will try commenting out the diagnostics for now and see how it goes.

oceandie commented 3 years ago

Hi @jpolton , I think I know what is the problem:

PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)

I am passing a ranking which is negative. I do this on purpose in order to identify the processor where the global maximum occurs. This worked in our HPC (could be an option for you to run on Monsoon? ... I compiled it there and it wroks perfectly). I am sure there is a compilation flag to deal with this, or I can try to rewrite the code taking into account this ... let me work on it a bit! ;)

... in the meanwhile, I think that, as you said, if you comment the diagnostic part it should work ...

I'll let you know as soon as I'll have some news ...

oceandie commented 3 years ago

@jpolton , actually it shouldn't happen that a negative ranking is passed to mppbcast_a_real: the subroutine mpp_max() should take care of it ... apparently it is not working on ARCHER2 ... could you please add

WRITE(numout,*) 'irnk = ', irnk

after L564 of mes.F90 and run the program?

mpp_max() is a standard NEMO subroutine, it should get the maximum of a scalar or an array (integer or real) ... I'll investigate a bit more in the meanwhile

oceandie commented 3 years ago

Also, another think to try:

Could you please move

IF( lk_mpp ) CALL mppsync

after

IF( lk_mpp ) CALL mpp_max(irnk)

and try if this solve? ...

jpolton commented 3 years ago

I will try the above. In the meanwhile I ran make_domain_cfg.exe having commented out the diagnostics in mes.F90. I get meshmask????.nc output files. Looking at one of them I was surprised to not see any e3t etc. This is odd right?

ncdump -h mesh_mask_0004.nc 
netcdf mesh_mask_0004 {
dimensions:
    x = 365 ;
    y = 672 ;
    z = 51 ;
    t = UNLIMITED ; // (1 currently)
variables:
    float nav_lon(y, x) ;
    float nav_lat(y, x) ;
    float nav_lev(z) ;
    double time_counter(t) ;
    byte tmask(t, z, y, x) ;
    byte umask(t, z, y, x) ;
    byte vmask(t, z, y, x) ;
    byte fmask(t, z, y, x) ;
    byte tmaskutil(t, y, x) ;

// global attributes:
        :DOMAIN_number_total = 8 ;
        :DOMAIN_number = 4 ;
        :DOMAIN_dimensions_ids = 1, 2 ;
        :DOMAIN_size_global = 1458, 1345 ;
        :DOMAIN_size_local = 365, 672 ;
        :DOMAIN_position_first = 1, 674 ;
        :DOMAIN_position_last = 365, 1345 ;
        :DOMAIN_halo_size_start = 0, 0 ;
        :DOMAIN_halo_size_end = 0, 0 ;
        :DOMAIN_type = "BOX" ;

BTW the ARCHER2 queue is slow now that everyone is awake...

oceandie commented 3 years ago

@jpolton yes this is odd indeed. Two explanations I can think of:

  1. The job is still not finished - the domain_cfg_xxxx.nc files are produced?
  2. The program didn't work properly.

Would it be possible for you to share the bathy_meter.nc and coordinates.nc files you are using to build the domain_cfg.nc file with me on jasmin?

jpolton commented 3 years ago

Regarding the previous mesh_mask query, the run did not finish properly. So that is probably that.

Ran with the new print statement and switched mppsync and mpp_max calls. It fails. irnk = -1

Crash log:

less slurm-427418.out

MPICH ERROR [Rank 0] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001131] - Abort(738816775) (rank 0 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffff0af83c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffff0af83c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
 irnk =  -1
 irnk =  -1
 irnk =  -1
MPICH ERROR [Rank 6] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001206] - Abort(269054727) (rank 6 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd1d3c6bc0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd1d3c6bc0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
 irnk =  -1
MPICH ERROR [Rank 7] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001206] - Abort(134836999) (rank 7 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff8dd2e9c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff8dd2e9c0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878449] [nid001206:244032:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20003b940 apid 20003b940 is not released, refcount 1
[1627988704.878457] [nid001206:244032:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20003b93f apid 10003b940 is not released, refcount 1
MPICH ERROR [Rank 2] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001197] - Abort(537490183) (rank 2 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff3cba9f00, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff3cba9f00, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
 irnk =  -1
MPICH ERROR [Rank 3] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001197] - Abort(1007252231) (rank 3 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff65f433e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff65f433e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878509] [nid001197:22671:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 200005890 apid 20000588f is not released, refcount 1
[1627988704.878515] [nid001197:22671:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000588f apid 10000588f is not released, refcount 1
MPICH ERROR [Rank 4] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001198] - Abort(67728135) (rank 4 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd48e652e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd48e652e0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
PMPI_Bcast(414): Invalid root (value given was -1)
 irnk =  -1
MPICH ERROR [Rank 5] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001198] - Abort(403272455) (rank 5 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc9884d8a0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffc9884d8a0, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878990] [nid001198:180810:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20002c24b apid 20002c24a is not released, refcount 1
[1627988704.878996] [nid001198:180810:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20002c24a apid 10002c24a is not released, refcount 1
[1627988704.879006] [nid001198:180811:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20002c24b apid 20002c24b is not released, refcount 1
[1627988704.879011] [nid001198:180811:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20002c24a apid 10002c24b is not released, refcount 1
[1627988704.878786] [nid001206:244031:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20003b940 apid 20003b93f is not released, refcount 1
[1627988704.878791] [nid001206:244031:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20003b93f apid 10003b93f is not released, refcount 1
[1627988704.878837] [nid001197:22672:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 200005890 apid 200005890 is not released, refcount 1
[1627988704.878841] [nid001197:22672:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20000588f apid 100005890 is not released, refcount 1
 irnk =  -1
MPICH ERROR [Rank 1] [job id 427418.0] [Tue Aug  3 12:05:04 2021] [nid001131] - Abort(470381319) (rank 1 in comm 0): Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdf4180340, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)

aborting job:
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7ffdf4180340, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000001) failed
PMPI_Bcast(414): Invalid root (value given was -1)
[1627988704.878846] [nid001131:172482:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20002a1c1 apid 10002a1c2 is not released, refcount 1
[1627988704.878853] [nid001131:172482:0]       mm_xpmem.c:82   UCX  WARN  remote segment id 20002a1c2 apid 20002a1c2 is not released, refcount 1
srun: error: nid001197: tasks 2-3: Exited with exit code 255
srun: Terminating job step 427418.0
srun: error: nid001131: task 1: Exited with exit code 255
srun: error: nid001198: tasks 4-5: Exited with exit code 255
srun: error: nid001206: tasks 6-7: Exited with exit code 255
slurmstepd: error: *** STEP 427418.0 ON nid001131 CANCELLED AT 2021-08-03T12:05:05 ***
srun: error: nid001131: task 0: Terminated
srun: Force Terminated job step 427418.0

I will copy the bathymetry and coordinates files to JASMIN for you to have a go to see if it is a Jeff-error or an ARCHER-issue

jpolton commented 3 years ago

@oceandie Boo hoo it is not working well for me and ARCHER2 is now painfully slow :-( I've copied the key files to JASMIN, perhaps you can see if it works on MONSooN:

/gws/nopw/j04/jmmp_collab/tmp_jelt_diego less README

amm15.coordinates.nc -- raw coordinates file
amm15.bathydepth.hook.nc -- raw bathy file
bathymetry.MEs_2env_0.24_0.07_opt.nc - output from running generate_envelopes.py, and some preprocessing steps in https://github.com/JMMP-Group/CO9_AMM15/wiki/Make-ME-domain-configuration-file

If you want to see what I've been doing it is on the branch ME_domaincfg

oceandie commented 3 years ago

@jpolton OK, I will have a go now, I'll let you know how it goes ;)

oceandie commented 3 years ago

Hi @jpolton,

I think I solved the problem:

  1. If I use the bathymetry+envelopes file created by you (i.e. bathymetry.MEs_2env_0.24_0.07_opt.nc) I obtain the same error that you got on ARCHER2, i.e.
    PMPI_Bcast(454): MPI_Bcast(buf=0x7ffd53623900, count=5, MPI_DOUBLE_PRECISION, root=-1, comm=comm=0x84000002) failed
    PMPI_Bcast(414): Invalid root (value given was -1)
  2. Then, I decided to recreate the bathymetry+envelopes file to see if the problem is there. I used the following steps: a. ncks -C -v Bathymetry amm15.bathydepth.hook.nc amm15.bathydepth.nc in order to avoid xarray complaining about dimensions. b. python generate_envelopes.py amm15_MEs_2env_0.24_0.07_opt.inp, where amm15_MEs_2env_0.24_0.07_opt.inp is the input file adapted for AMM15. The output of this is bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc.

Then I used this new bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc with the DOMAINcfg tool and I was able to create correct mesh_mask.nc and domain_cfg.nc files.

Maybe the pre-processing that you applied introduced something that disturbed the code ...

Anyway, I will put my files

  1. amm15.bathydepth.nc
  2. amm15_MEs_2env_0.24_0.07_opt.inp
  3. bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc
  4. mesh_mask.amm15_MEs_2env_0.24_0.07_opt.nc
  5. domain_cfg.amm15_MEs_2env_0.24_0.07_opt.nc

on jasmin @ /home/users/dbruciaf/tmp_diego_amm15-MEs . Could you tell me please if you are able to create domain_cgf and mesh_mask file with the new input files and if they are exactly the same than mine?

jpolton commented 3 years ago

Didn't mean to close this ticket with the python code merge.

jpolton commented 3 years ago

@oceandie I have a permissions issue:

[jelt@xfer1 users]$ cd dbruciaf/
-bash: cd: dbruciaf/: Permission denied

Can you change them or put them in /gws/nopw/j04/jmmp_collab/?

oceandie commented 3 years ago

@oceandie I have a permissions issue:

[jelt@xfer1 users]$ cd dbruciaf/
-bash: cd: dbruciaf/: Permission denied

Can you change them or put them in /gws/nopw/j04/jmmp_collab/?

@jpolton, done, they are in /gws/nopw/j04/jmmp_collab/tmp_diego_amm15-MEs :)

jpolton commented 3 years ago

Using your bathymetry.amm15_MEs_2env_0.24_0.07_opt.nc, a fresh checkout and build of make_domain_cfg.exe I get the mesh_mask files (which I only got before after commenting out diagnostics in mes.F90). I also get further through the process. It looks like it stops arbitrarily in the lower envelope:

less ocean.output
...
 -----------------------------------------------------------------

       MEs-levels depths and scale factors

       k     gdepw1      e3w1       gdept1      e3t1

     1,  0.,  0.41666666666666663,  0.20833333333333331,  0.41666666666666663
     2,  0.41666666666666663,  0.41666666666666669,  0.625,  0.41666666666666663
     3,  0.83333333333333326,  0.41666666666666674,  1.0416666666666667,  0.41666666666666674
     4,  1.25,  0.41666666666666674,  1.4583333333333335,  0.41666666666666652
     5,  1.6666666666666665,  0.41666666666666652,  1.875,  0.41666666666666696
     6,  2.0833333333333335,  0.41666666666666652,  2.2916666666666665,  0.41666666666666652
     7,  2.5,  0.41666666666666652,  2.708333333333333,  0.41666666666666696
     8,  2.916666666666667,  0.41666666666666696,  3.125,  0.41666666666666607
     9,  3.333333333333333,  0.41666666666666696,  3.541666666666667,  0.41666666666666696
     10,  3.75,  0.41666666666666607,  3.958333333333333,  0.41666666666666696
     11,  4.166666666666667,  0.41666666666666696,  4.375,  0.41666666666666607
     12,  4.583333333333333,  0.41666666666666696,  4.791666666666667,  0.41666666666666696
     13,  5.,  0.41666666666666696,  5.2083333333333339,  0.41666666666666607
     14,  5.4166666666666661,  0.41666666666666607,  5.625,  0.41666666666666785
     15,  5.8333333333333339,  0.41666666666666607,  6.0416666666666661,  0.41666666666666607
     16,  6.25,  0.41666666666666785,  6.4583333333333339,  0.41666666666666607
     17,  6.6666666666666661,  0.41666666666666607,  6.875,  0.41666666666666785
     18,  7.0833333333333339,  0.41666666666666607,  7.2916666666666661,  0.41666666666666607
     19,  7.5,  0.41666666666666785,  7.7083333333333339,  0.41666666666666607
     20,  7.9166666666666661,  0.41666666666666607,  8.125,  0.41666666666666785
     21,  8.3333333333333339,  0.41666666666666607,  8.5416666666666661,  0.41666666666666607
     22,  8.75,  0.41666666666666785,  8.9583333333333339,  0.41666666666666607
     23,  9.1666666666666661,  0.41666666666666607,  9.375,  0.41666666666666785
     24,  9.5833333333333339,  0.41666666666666607,  9.7916666666666661,  0.41666666666666607
     25,  10.,  0.44893752898995309,  10.240604195656619,  0.59813355512192423
     26,  10.598133555121924,  0.8346313245425474,  11.075235520199167,  1.0762266956483639
     27,  11.674360250770288,  1.3225044819382106,  12.397740002137377,  1.5730088158643518
     28,  13.24736906663464,  1.8272444801782672,  14.224984482315644,  2.0846786475108079
     29,  15.332047714145448,  2.3447430306400285,  16.569727512955673,  2.6068364401277062
     30,  17.938884154273154,  2.8703277399238729,  19.440055252879546,  3.1345591852538561
     31,  21.07344333952701,  3.3988501207367001,  22.838905373616246,  3.662501010366185
     32,  24.735944349893195,  3.9247977648547909,  26.763703138471037,  4.1850163260389728
     33,  28.920960675932168,  4.4424274627107607,  31.206130601181798,  4.6963017275146051
     34,  33.617262403446773,  4.9459145205562365,  36.152045121738034,  5.1905512022275531
     35,  38.807813605674326,  5.4295121955568675,  41.581557317294902,  5.6621180172234915
     36,  44.469931622897818,  5.8877141762826923,  47.469271493577594,  6.105

With a seg fault:

[nid001255:35553:0:35553] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ebd399b10)
==== backtrace (tid:  35553) ====
 0  /opt/cray/cray-ucx/2.6.0-3/ucx/lib/libucs.so.0(ucs_handle_error+0x104) [0x2b0554339f74]
 1  /opt/cray/cray-ucx/2.6.0-3/ucx/lib/libucs.so.0(+0x2435c) [0x2b055433a35c]
 2  /opt/cray/cray-ucx/2.6.0-3/ucx/lib/libucs.so.0(+0x245c4) [0x2b055433a5c4]
 3  /lib64/libc.so.6(+0x395a0) [0x2b0552b095a0]
 4  /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x317a54]
 5  /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x30262d]
 6  /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x21bb08]
 7  /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x2199a9]
 8  /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x219988]
 9  /lib64/libc.so.6(__libc_start_main+0xea) [0x2b0552af434a]
10  /work/n01/n01/jelt/CO9_AMM15/BUILD_CFG/4.0.4/tools/DOMAINcfg/BLD/bin/make_domain_cfg.exe() [0x2198ea]
=================================
srun: error: nid001255: task 0: Segmentation fault
srun: Terminating job step 428453.0
slurmstepd: error: *** STEP 428453.0 ON nid001255 CANCELLED AT 2021-08-03T16:55:08 ***
srun: error: nid001255: task 1: Terminated
srun: error: nid001256: task 3: Terminated
srun: error: nid001258: tasks 6-7: Terminated
srun: error: nid001257: tasks 4-5: Terminated
srun: error: nid001256: task 2: Terminated
srun: Force Terminated job step 428453.0

I wonder if it is something naive I've done with configuring the namelist_cfg file? Can you send your namelist_cfg? I've been using https://github.com/JMMP-Group/CO9_AMM15/blob/ME_domaincfg/BUILD_CFG/DOMAIN/ME_DOMAINcfg_namelist_cfg

Regarding differences between yours and my bathy+env file, you amm15_MEs_2env_0.24_0.07_opt.inp file has e_glo_rmx = [ 0.1, 0.07]
which I think should be e_glo_rmx = [ 0.24, 0.07].

However with your *inp file I spotted that I perhaps should not specify velocity variables if not specifying velocity files. This led to ... (not sure yet as it is sitting in a queue and I don't want to lose this window)

oceandie commented 3 years ago

@jpolton you can find the namelist I used in /gws/nopw/j04/jmmp_collab/tmp_diego_amm15-MEs/.

Sorry, but it is not clear to me: did you have to comment the diagnostic this time or not? Would it be possible to send me your ocean.output please?

I think in the case of amm15 it should be

e_glo_rmx = [ 0.1, 0.07]

since the actual operational amm15 has rmax=0.1

jpolton commented 3 years ago

@jpolton you can find the namelist I used in /gws/nopw/j04/jmmp_collab/tmp_diego_amm15-MEs/.

Sorry, but it is not clear to me: did you have to comment the diagnostic this time or not? Would it be possible to send me your ocean.output please?

I think in the case of amm15 it should be

e_glo_rmx = [ 0.1, 0.07]

since the actual operational amm15 has rmax=0.1

I did the new experiments with a fresh mes.F90 with out commenting out the diagnostics. The new tests stop within the diagnostics. (I find this a bit odd but haven't put print statements in to see what is now happening in mes.F90).

I attached my ocean.output.txt

I have updated the steps to use your namelist_cfg, and upper env rmax=0.1 in e_glo_rmx = [0.1, 0.07] I have updated all my wiki recipe and repo branch:ME_domaincfg to replicate the latest of everything I did on ARCHER2. I have run out of time on this so have passed the torch to @anwiseNOCL

oceandie commented 3 years ago

Hi @jpolton,

OK, happy to support @anwiseNOCL if needed :)

However, looking at your ocean.output I can see that the problem is not anymore the call to mppbcast_a_real, using a correct bathy+env seems to have solved the problem as on monsoon.

Also, comparing my ocean.output on monsoon and your from archer2, they are exaclty identical up to the point where the archer2 run fails, which is a simple printing ... the seg fault message doesn't help much, I would add some printing to debug ...

anwiseNOCL commented 3 years ago

@jpolton @oceandie I've managed to get this working on ARCHER, but only with rmax=0.1 in the upper envelope. When I build with 0.24 the irnk variable is set to -1, so that means that IF ((mi0(eii)>1 .AND. mi0(eii)<jpi) .AND. (mj0(eij)>1 .AND. mj0(eij)<jpj)) THEN is evaluating to false. I don't really know what that means. When the diagnostics are on that throws the MPICH_ERROR, but when the diagnostics are commented a seg fault is thrown. so clearly that is not good.

I think I remember that AMM15 was going with the rmax=0.1 anyway, but @oceandie maybe worth you trying rmax=0.24 on Monsson to see if you get the same issue?

Should I hand the dom_cfg to @davbyr now so we can get the initial velocity fields. Hourly averages I think is what we had last time

oceandie commented 3 years ago

Thanks @anwiseNOCL

Out of curiosity, when you try to build with 0.24, are you using the old bathy+env file of Jeff or did you create a new one?

oceandie commented 3 years ago

I just realised that the files I passed to you have the wrong name:

  1. amm15_MEs2env0.24_0.07_opt.inp
  2. bathymetry.amm15_MEs2env0.24_0.07_opt.nc
  3. mesh_mask.amm15_MEs2env0.24_0.07_opt.nc
  4. domain_cfg.amm15_MEs2env0.24_0.07_opt.nc

We should change 0.24 with 0.10 to reflect the fact that the upper envelope uses rmax 0.10.

jpolton commented 3 years ago

Well done @anwiseNOCL on getting that working. Could you update the wiki and branch to reflect any bugs you squashed in the workflow?

Before handing the domain_cfg.nc file to @davbyr we probably out to make sure that a minimum depth of 10m is applied or he will certainly have trouble. But yes the next stage is to hand it on for HPG testing and velocity extraction on this P0.0 bathy version with ME coordinates.

anwiseNOCL commented 3 years ago

Edit, missed thee new mesg.

@oceandie I created a new envelope file using your amm15.bathydepth.nc file from jasmin. Then that envelope file and the coordinates are the inputs for make_doman_cfg.exe. It works for r=0.1 then i change it to r0.24 recreate the envelope file and it throws the error.

@jpolton Yes i'll run through it again and update the wiki in the process. This 10 m depth thing. I just had rn_bot_min=10 so in my head this means that in the domain_cfg.nc the bathymetry >= 10. is that what we are after? I will check that this is what happens.

oceandie commented 3 years ago

@jpolton , @anwiseNOCL

I think I solved the problem. Also on monsoon, when using rmax=0.24 I get MPI error. The problem was due to:

IF ((mi0(eii)>1 .AND. mi0(eii)<jpi) .AND. (mj0(eij)>1 .AND. mj0(eij)<jpj)) THEN

which should be really

IF ((mi0(eii)>=1 .AND. mi0(eii)<=jpi) .AND. (mj0(eij)>=1 .AND. mj0(eij)<=jpj)) THEN

In the case of AMM15 with rmax=0.24 the point where the global minimum is located is just on the boundary, so it wasn't picked by the old condition, breaking the code. With the new condition I am able to create domain_cfg.nc and mesh_mask.nc files also for AMM15_0.24.

The updated revision of the code is DOMAINcfg@15172.

I think this potentially also means that the pre-proc of Jeff wasn't introducing any error, but we should double-check.

oceandie commented 3 years ago

Regarding the pre-proc to apply the min_depth to the bathymetry, I think we should avoid this step for two reasons:

  1. Envelopes should be based on the real bathymetry.
  2. As in standard NEMO coordinates, the rn_bot_min takes care of the minimum depth to apply.
anwiseNOCL commented 3 years ago

@jpolton @oceandie Using rn_bot_min=10 gives bathy_meter.min() = 7.908 . I don't know why, or if this is a problem (is this something to do with W&D ?) But I checked with diego's file and it is the same.

jpolton commented 3 years ago

@jpolton @oceandie Using rn_bot_min=10 gives bathy_meter.min() = 7.908 . I don't know why, or if this is a problem (is this something to do with W&D ?) But I checked with diego's file and it is the same.

Sorry I don't know why that is. (I think this is why I manually set a min depth using netcdf tools). Though surely it is nothing to do with W&D as that is not invoked at this stage of building.

oceandie commented 3 years ago

Hi @anwiseNOCL @jpolton,

if you were checking the minimum of bathy_meter variable in the domain_cfg.nc file, please be aware that bathy_meter is not the simple raw bathymetry received in input with the minimum and maximum depths applied but is the model bathymetry computed in domain.f90 as follows:

DO jj = 1,jpj
      DO ji = 1,jpi
            z2d (ji,jj) = SUM ( e3t_0(ji,jj, 1:mbkt(ji,jj) ) ) * ssmask(ji,jj)
      END DO
END DO
CALL iom_rstput( 0, 0, inum, 'bathy_meter'   , z2d , ktype = jp_r4 )

so it is the sum of "wet" e3t and not the bathymetry used to generate the vertical grid. Also, consider that this sum will not be exactly equal to the total depth of the bathymetry when the envelope is deeper than the real bathymetry.

This bathy_meter variable in the domain_cfg file is part of the standard DOMAINcfg code. If you want I can modify it to output the real bathymetry used to generate the vertical grid to check that the minimum is applied.

oceandie commented 3 years ago

@jpolton, @anwiseNOCL

I added the bathymetry in the domain_cfg.nc file. The revision of the code including this change is DOMAINcfg@15174.

If we check:

import xarray as xr
ds_dom = xr.open_dataset("domain_cfg.nc")
bathy = ds_dom["bathymetry"]
bathy = bathy.where(bathy>0)
bathy.min()
<xarray.DataArray 'bathymetry' ()>
array(10., dtype=float32)
anwiseNOCL commented 3 years ago

@oceandie @jpolton What i was trying to ask @jpolton is why exactly we need the 10 m min depth? Is it a requirement for the W&D or something?

The reason I asked is that, if it is for W&D, then i would assume it needs to be a condition on the depth of the last wet w-level rather than the input bathymetry file? And as Diego says, they are not the same. So how to proceed?

oceandie commented 3 years ago

@anwiseNOCL the 10m minimum depth is a crude parameterization we use to deal with large tidal excursions in absence of a W&D algorithm. I think the specific 10m value comes from AMM7 and then passed to AMM15, but I'm not sure, maybe @jpolton knows more ...

With the W&D, we will finally get rid of this parameterization (hopefully), which has also quite an important negative impact on tidal dynamics (see e.g. Jenny's paper).

jpolton commented 3 years ago

@oceandie @anwiseNOCL Yes: The 10m minimum depth stops the regular model breaking on spring tides, where the amplitude approaches 10m. This was set by trial and error in the Old Days. Wetting and Drying functionality should mean we don't need to set a minimum depth.

jpolton commented 3 years ago

For the sake of closure we can generate the minimum depth using either the preprocessing python method (a la @endaodea), or using the make_domain_cfg.exe namelist (a la @oceandie). Each have different effects. We have already started down the latter path so will probably stick with it. However in the future everyone will use Wetting and Drying so this distinction will not be so important. However it is important now to archive the domain_cfg.nc file and state what we do.