Model run hangs at the end when trying to write icebergs trajectories

nikizadehgfdl commented 9 years ago

1 year long run of an ocean-ice model hangs at the end of the run. stdout indicates that the last thing the model was doing was writing the iceberg trajectories:

diamonds, bergs_chksum: write_restart berg chksum=             583222154 chksum2=             706327505 chksum3=           -1682259823 chksum4=           -1682259823 chksum5=            -865960067 #=                 15170
diamonds, grd_chksum2:    # of bergs/cell chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.000000000E+00 sd= 0.000000000E+00
diamonds, grd_chksum3:   write stored_ice chksum=             -81009900 chksum2=             552945600 min= 0.000000000E+00 max= 7.399908186E+11 mean= 1.057744475E+11 rms= 1.701813200E+11 sd= 1.333170955E+11
diamonds, grd_chksum2:  write stored_heat chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.000000000E+00 sd= 0.000000000E+00

The above was on 128 cores interactive on gaea.

Note that this happens for the longer runs that have a lot of icebergs at the end (this one has 15170). Since the trajectories stay in memory and gets written to file at the end of the run this may indicate an issue with the I/O buffer. Is there a way to increase that buffer?

Adcroft-personal commented 9 years ago

I think it would be better to allow writing of trajectories while the run is ongoing rather than relying on memory that might be smaller on a different machine.

In order to write trajectory data during the run, rather than at the end of the run, we need 1) trajectory i/o to not clobber files; 2) the trajectories to be reset to zero length after writing; 3) optionally, we should also walk through the list of bergs before calling write_trajectory() and call move_traj() to detach trajectories from active bergs similer to as is done in icebergs_end(). This is because write_trajectory() only writes from the trajectories list which is populated when a berg is destroyed (melts/leaves PE/end of run).

L912 uses cf_create() with NF_CLOBBER. We need this on the very first invocation of write_tractory() but for subsequent calls we need to use nf_open( with NF_WRITE. I would suggest a second argument passed to write_trajectory() that the calling code can use to indicate whether the routine has been called before or not. We should not use "save" data.
write_trajectory() walks through the linked list of copies everything to a traj4buf. We need to deallocate each link after copying it, probably at L848 and L878.
Periodically call move_traj() for all bergs and then call write_trajectory(). This should probably be triggered on dates but we could also do it on memory usage?

Re (1), we could stop clobbering and always append. I think the diag_manager policy is to always clobber, so we should probably should to?

nikizadehgfdl commented 9 years ago

I did a little more experimenting. For 128 pes things work for a 4 months run which has 5310 bergs (56M trajectory file) but the run hangs for a 5 month run which has 6961 bergs (without even opening a trajectory file). The same happens for 256 pes, so it is not dependent on the # of pes. Moreover the same happens if ice io_layout is not 0,0.

How could writing a file of the order of 100M cause a hang? There must be some buffer size environment variable that fixes this!

nikizadehgfdl commented 9 years ago

Totalview shows that the root pe (0) is stuck at the call icebers_end: write_trajectory: ice_bergs_framework: unpack_traj_from_buffer2() and all other pes are waiting for it at the end of coupler_restart()

nikizadehgfdl commented 9 years ago

I adjusted a namelist to increase the # of bergs artificially ( initial_mass(1)= 8.8e4 and mass_scaling(1) = 1000) . The model produced 48560 bergs in just 10 days, but it also finished successfully and wrote a 119M iceberg_trajectories.nc ! Whatever is causing the hang does not like cheating!

nikizadehgfdl commented 9 years ago

Running in debug mode results a crash in icebergs_save_restart before the model gets to writing trajectories. Here's the stack:

forrtl: severe (408): fort: (2): Subscript #1 of the array SBUF has value 1 which is greater than the upper bound of 0

fms_MOM6_SIS2_com  0000000005CC8FB7  mpp_mod_mp_mpp_ga          77  mpp_gather.h
fms_MOM6_SIS2_com  00000000059CCAA4  mpp_io_mod_mp_mpp          30  mpp_write_unlimited_axis.h
fms_MOM6_SIS2_com  0000000004A71046  fms_io_mod_mp_sav        2516  fms_io.F90
fms_MOM6_SIS2_com  0000000004A5A4CF  fms_io_mod_mp_sav        2114  fms_io.F90
fms_MOM6_SIS2_com  0000000003A723EE  ice_bergs_io_mp_w         211  icebergs_io.F90
fms_MOM6_SIS2_com  00000000039994C4  ice_bergs_mp_iceb        2152  icebergs.F90
fms_MOM6_SIS2_com  00000000033A275E  ice_type_mod_mp_i         967  ice_type.F90
fms_MOM6_SIS2_com  00000000032021C4  ice_model_mod_mp_        4071  ice_model.F90
fms_MOM6_SIS2_com  000000000040B603  coupler_main_IP_c        1662  coupler_main.F90
fms_MOM6_SIS2_com  0000000000404CEF  MAIN__                    887  coupler_main.F90

nikizadehgfdl commented 9 years ago

Back to the prod mode testing, I don't think it's a hang, neither is it in the I/O. The model is just soooo slow to "gather" the buffer on root pe (via mpp_send and mpp_recv) that would then write to file.

I just ran a test on 128 pes that produced 40491 bergs in 5 days and it looked like it was hung at the end. But after 15 minutes waiting the model finished successfully and wrote a 52M trajectory file!

nikizadehgfdl commented 9 years ago

I can see (via totalview) that the poor root pe has received a massive amount of blobs (single point-in-time iceberg data) from the pes in the tile and is sifting through them. It receives the trajectory for every iceberg that has ever come into existence (and may not be present anymore) which is something like N blobs (N=# of total timesteps) for each iceberg that has ever existed. E.g, for a 5 day run root pe has received 16000 blobs just from one pe, so totally something like one million blobs that have to be linked together to make the final trajectory file.

I think it is the process of traversing the final linked list to find its tail (to add a new leaf) that may be the issue here (just because every time I halt the run I find it inside that traversing loop). There has to be a better way to do this:

 subroutine append_posn(trajectory, posn_vals)
 ! This routine appends a new position leaf to the end of the given trajectory 
 ! Arguments
 type(xyt), pointer :: trajectory
 type(xyt) :: posn_vals
 ! Local variables
 type(xyt), pointer :: new_posn,next,last

   allocate(new_posn)
   new_posn=posn_vals
   new_posn%next=>null()
   if(.NOT. associated(trajectory)) then
      trajectory=>new_posn
   else
      ! Find end of the trajectory and point it to the  new leaf
      next=>trajectory
      do while (associated(next))
         last=>next
         next=>next%next
      enddo
      last%next=>new_posn
   endif
 end subroutine append_posn

Solutions I can think of are:

Save a pointer to the tail and reuse it rather than searching for it. Module variable, safe?
Append to the head and write it in reverse.

nikizadehgfdl commented 9 years ago

The 2nd solution above seems to have worked. There is no more "hang", just a 2 minutes delay to reverse a list of 3,241,686 berg nodes and a 557MB trajectory file for a 1 year run of Alon's model that was timing out before.

I'll make a pull request for review. The git branch is user/nnz/fix_trajectory_append_hang

nikizadehgfdl commented 9 years ago

I tried to keep running the model to see what happens. It's still bad. Look at what happens to the ratio of termination/main and the size of trajectory file :

year 1 , bergs# 14996, size of traj file: 526M
Main loop                          6004.337127   6004.338530   6004.337372      0.000195  0.735     0     0   127
Termination                        2159.423813   2159.425217   2159.424867      0.000225  0.264     0     0   127

year 2 , bergs# 24650, size of traj file: 1.3G
Main loop                          6102.388496   6102.389873   6102.388686      0.000448  0.315     0     0   127
Termination                       13233.006461  13233.007851  13233.007665      0.000452  0.684     0     0   127

This is for io_layout=0,0 I'll try to set a io_layout for bergs to see what happens.

One thing that confuses me is how come I was getting only a few minutes for the 1st year above!

nikizadehgfdl commented 9 years ago

I did a more careful analysis of the timings by putting clocks around the trajectory preparation (gather on io root) and actual writing to file. Here are the results for 2 year run of MOM6_GOLD_SIS2 (Alon's experiment) with 24650 bergs:

Write to 1 file per processor, correct order of bergs (first location first)  (old way):
MOM6_GOLD_SIS2_bergs__5.o5041140
                                           max time(sec)
Main loop                          4283.5
Termination                          60.7
Icebergs-traj write                54.6
Icebergs-traj prepare             0.6

Write to 1 file per processor, reverse order of bergs (last location first):
MOM6_GOLD_SIS2_bergs__2.o5041136
                                           max time(sec)
Main loop                           4279.9
Termination                           60.6
Icebergs-traj write                 54.7
Icebergs-traj prepare              0.6

Write to 1 file per io tile, 16 tiles, reverse order of bergs (last location first):
MOM6_GOLD_SIS2_bergs__3.o5041138
                                           max time(sec)
Main loop                          4279.2
Termination                       8390.
Icebergs-traj write               109.4
Icebergs-traj prepare        8278.4

Write to 1 file per io tile, 16 tiles, correct order of bergs (first location first):
MOM6_GOLD_SIS2_bergs__4.o5041139
                                           max time(sec)
Main loop                          4278.2
Termination                       7972.5
Icebergs-traj write               109.3
Icebergs-traj prepare         7857.9

The above stdouts are at /lustre/f1/Niki.Zadeh/ulm_mom6_2015.06.03/MOM6_GOLD_SIS2_bergs/ncrc2.intel-prod/stdout/run/

So as you see the old way of writing a trajectory file per processor is the best in terms of performance and the new way of preparing the trajectories (meaning sending and gathering blobs on the tile io) is simply not acceptable for long model runs (Termination time > Main loop). We have to redesign the way we write these trajectories if we want to use the io layout.

Until then I updated my branch code to do it the old way by setting a new namelist in icebergs_nml

force_all_pes_traj = .true.

nikizadehgfdl commented 7 years ago

This issue has not been seen again. Closing.

NOAA-GFDL / icebergs

Model run hangs at the end when trying to write icebergs trajectories #1