Closed nikizadehgfdl closed 7 years ago
I think it would be better to allow writing of trajectories while the run is ongoing rather than relying on memory that might be smaller on a different machine.
In order to write trajectory data during the run, rather than at the end of the run, we need 1) trajectory i/o to not clobber files; 2) the trajectories to be reset to zero length after writing; 3) optionally, we should also walk through the list of bergs before calling write_trajectory() and call move_traj() to detach trajectories from active bergs similer to as is done in icebergs_end(). This is because write_trajectory() only writes from the trajectories list which is populated when a berg is destroyed (melts/leaves PE/end of run).
Re (1), we could stop clobbering and always append. I think the diag_manager policy is to always clobber, so we should probably should to?
I did a little more experimenting. For 128 pes things work for a 4 months run which has 5310 bergs (56M trajectory file) but the run hangs for a 5 month run which has 6961 bergs (without even opening a trajectory file). The same happens for 256 pes, so it is not dependent on the # of pes. Moreover the same happens if ice io_layout is not 0,0.
How could writing a file of the order of 100M cause a hang? There must be some buffer size environment variable that fixes this!
Totalview shows that the root pe (0) is stuck at the call icebers_end: write_trajectory: ice_bergs_framework: unpack_traj_from_buffer2() and all other pes are waiting for it at the end of coupler_restart()
I adjusted a namelist to increase the # of bergs artificially ( initial_mass(1)= 8.8e4 and mass_scaling(1) = 1000) . The model produced 48560 bergs in just 10 days, but it also finished successfully and wrote a 119M iceberg_trajectories.nc ! Whatever is causing the hang does not like cheating!
Running in debug mode results a crash in icebergs_save_restart before the model gets to writing trajectories. Here's the stack:
forrtl: severe (408): fort: (2): Subscript #1 of the array SBUF has value 1 which is greater than the upper bound of 0
fms_MOM6_SIS2_com 0000000005CC8FB7 mpp_mod_mp_mpp_ga 77 mpp_gather.h
fms_MOM6_SIS2_com 00000000059CCAA4 mpp_io_mod_mp_mpp 30 mpp_write_unlimited_axis.h
fms_MOM6_SIS2_com 0000000004A71046 fms_io_mod_mp_sav 2516 fms_io.F90
fms_MOM6_SIS2_com 0000000004A5A4CF fms_io_mod_mp_sav 2114 fms_io.F90
fms_MOM6_SIS2_com 0000000003A723EE ice_bergs_io_mp_w 211 icebergs_io.F90
fms_MOM6_SIS2_com 00000000039994C4 ice_bergs_mp_iceb 2152 icebergs.F90
fms_MOM6_SIS2_com 00000000033A275E ice_type_mod_mp_i 967 ice_type.F90
fms_MOM6_SIS2_com 00000000032021C4 ice_model_mod_mp_ 4071 ice_model.F90
fms_MOM6_SIS2_com 000000000040B603 coupler_main_IP_c 1662 coupler_main.F90
fms_MOM6_SIS2_com 0000000000404CEF MAIN__ 887 coupler_main.F90
Back to the prod mode testing, I don't think it's a hang, neither is it in the I/O. The model is just soooo slow to "gather" the buffer on root pe (via mpp_send and mpp_recv) that would then write to file.
I just ran a test on 128 pes that produced 40491 bergs in 5 days and it looked like it was hung at the end. But after 15 minutes waiting the model finished successfully and wrote a 52M trajectory file!
I can see (via totalview) that the poor root pe has received a massive amount of blobs (single point-in-time iceberg data) from the pes in the tile and is sifting through them. It receives the trajectory for every iceberg that has ever come into existence (and may not be present anymore) which is something like N blobs (N=# of total timesteps) for each iceberg that has ever existed. E.g, for a 5 day run root pe has received 16000 blobs just from one pe, so totally something like one million blobs that have to be linked together to make the final trajectory file.
I think it is the process of traversing the final linked list to find its tail (to add a new leaf) that may be the issue here (just because every time I halt the run I find it inside that traversing loop). There has to be a better way to do this:
subroutine append_posn(trajectory, posn_vals)
! This routine appends a new position leaf to the end of the given trajectory
! Arguments
type(xyt), pointer :: trajectory
type(xyt) :: posn_vals
! Local variables
type(xyt), pointer :: new_posn,next,last
allocate(new_posn)
new_posn=posn_vals
new_posn%next=>null()
if(.NOT. associated(trajectory)) then
trajectory=>new_posn
else
! Find end of the trajectory and point it to the new leaf
next=>trajectory
do while (associated(next))
last=>next
next=>next%next
enddo
last%next=>new_posn
endif
end subroutine append_posn
Solutions I can think of are:
The 2nd solution above seems to have worked. There is no more "hang", just a 2 minutes delay to reverse a list of 3,241,686 berg nodes and a 557MB trajectory file for a 1 year run of Alon's model that was timing out before.
I'll make a pull request for review. The git branch is user/nnz/fix_trajectory_append_hang
I tried to keep running the model to see what happens. It's still bad. Look at what happens to the ratio of termination/main and the size of trajectory file :
year 1 , bergs# 14996, size of traj file: 526M
Main loop 6004.337127 6004.338530 6004.337372 0.000195 0.735 0 0 127
Termination 2159.423813 2159.425217 2159.424867 0.000225 0.264 0 0 127
year 2 , bergs# 24650, size of traj file: 1.3G
Main loop 6102.388496 6102.389873 6102.388686 0.000448 0.315 0 0 127
Termination 13233.006461 13233.007851 13233.007665 0.000452 0.684 0 0 127
This is for io_layout=0,0 I'll try to set a io_layout for bergs to see what happens.
One thing that confuses me is how come I was getting only a few minutes for the 1st year above!
I did a more careful analysis of the timings by putting clocks around the trajectory preparation (gather on io root) and actual writing to file. Here are the results for 2 year run of MOM6_GOLD_SIS2 (Alon's experiment) with 24650 bergs:
Write to 1 file per processor, correct order of bergs (first location first) (old way):
MOM6_GOLD_SIS2_bergs__5.o5041140
max time(sec)
Main loop 4283.5
Termination 60.7
Icebergs-traj write 54.6
Icebergs-traj prepare 0.6
Write to 1 file per processor, reverse order of bergs (last location first):
MOM6_GOLD_SIS2_bergs__2.o5041136
max time(sec)
Main loop 4279.9
Termination 60.6
Icebergs-traj write 54.7
Icebergs-traj prepare 0.6
Write to 1 file per io tile, 16 tiles, reverse order of bergs (last location first):
MOM6_GOLD_SIS2_bergs__3.o5041138
max time(sec)
Main loop 4279.2
Termination 8390.
Icebergs-traj write 109.4
Icebergs-traj prepare 8278.4
Write to 1 file per io tile, 16 tiles, correct order of bergs (first location first):
MOM6_GOLD_SIS2_bergs__4.o5041139
max time(sec)
Main loop 4278.2
Termination 7972.5
Icebergs-traj write 109.3
Icebergs-traj prepare 7857.9
The above stdouts are at /lustre/f1/Niki.Zadeh/ulm_mom6_2015.06.03/MOM6_GOLD_SIS2_bergs/ncrc2.intel-prod/stdout/run/
So as you see the old way of writing a trajectory file per processor is the best in terms of performance and the new way of preparing the trajectories (meaning sending and gathering blobs on the tile io) is simply not acceptable for long model runs (Termination time > Main loop). We have to redesign the way we write these trajectories if we want to use the io layout.
Until then I updated my branch code to do it the old way by setting a new namelist in icebergs_nml
force_all_pes_traj = .true.
This issue has not been seen again. Closing.
1 year long run of an ocean-ice model hangs at the end of the run. stdout indicates that the last thing the model was doing was writing the iceberg trajectories:
The above was on 128 cores interactive on gaea.
Note that this happens for the longer runs that have a lot of icebergs at the end (this one has 15170). Since the trajectories stay in memory and gets written to file at the end of the run this may indicate an issue with the I/O buffer. Is there a way to increase that buffer?