Slow init time in OCN for hires problems (g18to6.T62_oRRS18to6v3.GMPAS-IAF)

ndkeen commented 7 years ago

Generic issue, but wanted a place to store notes and we will surely have some code changes to help diagnose/address.

I'm running on cori-knl, but I understand others have seen similar slow init times. I'm still trying to figure out if this is highly dependent on number of total MPI's, number of MPI's for a given component, other settings, etc. The following is for a G-case where I'm giving 150 nodes to OCN and 150 to ICE+CPL. On each node, using 64 MPI's (pure MPI mode) for a total of 9600 tasks on each of the MPAS components. I also see even slower init times for coupled hi-res problem that is using similar MPAS setup (afaik), the same number of nodes to each component, but more total nodes in the job -- however, as it fails in restart, I don't have complete timing files.

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        9600        9600     9600   x 1       1      (1     ) 
  atm = datm       128         19200    128    x 1       1      (1     ) 
  lnd = slnd       128         19200    128    x 1       1      (1     ) 
  ice = mpascice   9600        9600     9600   x 1       1      (1     ) 
  ocn = mpaso      9600        0        9600   x 1       1      (1     ) 
  rof = drof       128         19200    128    x 1       1      (1     ) 
  glc = sglc       128         19200    128    x 1       1      (1     ) 
  wav = swav       128         19200    128    x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 
...

    Init Time   :    1969.944 seconds 
    Run Time    :    1397.276 seconds      279.455 seconds/day

The timer comp_init_cc_ice reports 1467 which would indicate most of the time in OCN init. And o_i:PIO:pio_read_nfdarray_double has 898 seconds from the following file:

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m34-may17.n302t01/run/timing.170519-000252/model_timing.00000

Assuming the value of pnetcdf is being honored for PIO_TYPENAME that is set in env_run.xml, then it looks like this is the only call that could be causing the time:

          ierr=nfmpi_get_vara_all( File%fh,varDesc%varid, &
               start, &
               count, &
               IOBUF,iodesc%Read%n_ElemTYPE, &
               iodesc%Read%ElemTYPE)

I have some MCT files that @worleyph has modified to add more timers (though I need to backport that with the incoming MCT 2.0 files). I have jobs in the Q currently.

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m34-may17.n302t01

fyi, when I tried to a 32-node hi-res problem, giving OCN 15 nodes and ICE+CPL 15 nodes, I did a top on each compute node while it was in init. The first 15 nodes are using about 53 GB, while the remaining nodes are using 5GB.

ndkeen commented 7 years ago

Yes, I can try, but I admit, I'm not sure how?

vanroekel commented 7 years ago

in your run directory you should have a streams.ocean file. At the top should be something like

<immutable_stream name="mesh"
                  type="none"
                  io_type="pnetcdf,cdf5"
                  filename_template="/lustre/atlas1/cli900/world-shared/cesm/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
/>
<immutable_stream name="input"
                  type="input"
                  io_type="pnetcdf,cdf5"
                  input_interval="initial_only"
                  filename_template="/lustre/atlas1/cli900/world-shared/cesm/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
/>

to test this, change the path in both filename templates to point to the ocean restart I mentioned above. It should be identical for both of these streams. Once you make these changes copy the streams.ocean file to SourceMods/src.mpaso folder in your case directory and submit the test.

ndkeen commented 7 years ago

OK, I copied run/streams.ocean to my.streams.ocean, made the change:

cori05% diff run/streams.ocean my.streams.ocean 
6c6
<                   filename_template="/project/projectdirs/acme/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
---
>                   filename_template="/global/cscratch1/sd/lvroekel/mpaso.rst.0039-01-01_00000.nc"
12c12
<                   filename_template="/project/projectdirs/acme/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
---
>                   filename_template="/global/cscratch1/sd/lvroekel/mpascice.rst.0039-01-01_00000.nc"

And then just cp my.streams.ocean SourceMods/src.mpaso/streams.ocean ?

worleyph commented 7 years ago

Hi @vanroekel , I am trying this on Titan. log.ocean.###.err files are being generated containing error messages like:

 ERROR:  Warning: abs(sum(h)-bottomDepth)>2m.  Most likely, initial layerThickness does not match bottomDepth.

Ignoring that, model initialization dropped from 15 minutes to 3 minutes with this change in the initial condition file. The file read is now using the pnetcdf io type (instead of the netcdf4p io type), and is "fast". Looks like this solves the problem on Titan. Hopefully @ndkeen will see the same type of improvement on Cori-KNL.

mark-petersen commented 7 years ago

@worleyph that error message means the initial sea surface height is unphysical. It may indicate that variables are read in incorrectly, or initialized to zero. Are there any statistics written to check that the model ran normally? Perhaps an output file written at the end that can be compared with a previous run.

vanroekel commented 7 years ago

@ndkeen you are right on the process assuming you are going to your casedirectory/SourceMods/src.mpaso

worleyph commented 7 years ago

@mark-petersen, I won't be able to look at this for a few days. Perhaps someone else can look at this.

mark-petersen commented 7 years ago

@worleyph Can you give me the path of your run directory? I can check if results are valid.

ndkeen commented 7 years ago

Sorry I'm still having trouble verifying that I've done this right and it's working -- should be soon Ah, OK, the run I submitted last night seems to have failed with a segfault. I have been trying to clean up my additional timing/writes, but I don't think it would cause a fault here. I will try again.

Would I also need to edit streams.cice?

0000: forrtl: severe (174): SIGSEGV, segmentation fault occurred
0000: Image              PC                Routine            Line        Source             
0000: acme.exe           0000000001DF7561  Unknown               Unknown  Unknown
0000: acme.exe (deleted  0000000001DF569B  Unknown               Unknown  Unknown
0000: acme.exe           0000000001D9EEE4  Unknown               Unknown  Unknown
0000: acme.exe           0000000001D9ECF6  Unknown               Unknown  Unknown
0000: acme.exe (deleted  0000000001D21D06  Unknown               Unknown  Unknown
0000: acme.exe           0000000001D2D716  Unknown               Unknown  Unknown
0000: acme.exe (deleted  0000000001A0A490  Unknown               Unknown  Unknown
0000: acme.exe           0000000001E3F001  Unknown               Unknown  Unknown
0000: acme.exe           0000000001D6DBC7  Unknown               Unknown  Unknown
0000: acme.exe           0000000000E1839C  ocn_vmix_mp_ocn_t         458  mpas_ocn_vmix.f90
0000: acme.exe           0000000000E1372D  ocn_vmix_mp_ocn_v         572  mpas_ocn_vmix.f90
0000: acme.exe           0000000000B3D5F4  ocn_time_integrat        1721  mpas_ocn_time_integration_split.f90
0000: acme.exe           0000000000B33AA7  ocn_time_integrat         123  mpas_ocn_time_integration.f90
0000: acme.exe           0000000000B250B7  ocn_comp_mct_mp_o         916  ocn_comp_mct.f90
0000: acme.exe           000000000042AF5B  component_mod_mp_         705  component_mod.F90
0000: acme.exe           000000000040FBCB  cesm_comp_mod_mp_        3406  cesm_comp_mod.F90
0000: acme.exe           000000000042AC62  MAIN__                     68  cesm_driver.F90
0000: acme.exe (deleted  000000000040BB0E  Unknown               Unknown  Unknown
0000: acme.exe           0000000001E17A20  Unknown               Unknown  Unknown
0000: acme.exe           000000000040B9F7  Unknown               Unknown  Unknown

worleyph commented 7 years ago

@mark-petersen , the run directory is /lustre/atlas/scratch/worley/cli115/g18*pgi3/run . Note the wildcard in the case name - can't remember rhe full name. Think that it is open, but the log files typically are not readable by others, so probably won't help. I won't be able to get on and change permissions until this evening. Sorry.

ndkeen commented 7 years ago

Running again, I can verify that it is beyond OCN init much quicker than it was before. Still need to get a complete timing value.

ndkeen commented 7 years ago

OK, it looks like the OCN init time has drastically dropped. However, I still see the same seg fault as posted above which happens AFTER the model init is complete. The line in question is:

deallocate(A, B, C, tracersTemp)

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n066t01printsd

mark-petersen commented 7 years ago

@ndkeen that is very strange. The deallocate line is very standard:

vi src/core_ocean/shared/mpas_ocn_vmix.F
      real (kind=RKIND), dimension(:), allocatable :: A,B,C
      real (kind=RKIND), dimension(:,:), allocatable :: tracersTemp
...
      allocate(A(nVertLevels),B(nVertLevels),C(nVertLevels),tracersTemp(num_tracers,nVertLevels))
...
      deallocate(A, B, C, tracersTemp)

It's hard to imagine the problem. I looked for this error message in your directory but didn't find it. Did it give any clue about the cause of the error? Is this run with debug on, or openMP on?

philipwjones commented 7 years ago

@ndkeen and @mark-petersen - keep in mind this is also the last executable statement in the routine and sometime will get tagged for a seg fault for anything on the return from the subroutine. So it could be deallocate, but it could also be something else...

ndkeen commented 7 years ago

OK. It should be the case, that if I go back to using the older (slower) netcdf files in streams.ocean, it will simulate one day. Any other ways in which changing those two netcdf files might "conflict" with something else? Does streams.cice also need to be edited?

Another attempt (where I adjusted PE layouts) also failed in what looks like same location -- stack traces are a little more noisy. case dir here: /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n050t01printse

I guess Pat's run with that change did not cause a problem.

I also have a clean checkout of current master and trying same launch script WITH the edited streams.ocean file.

@mark-petersen : you said you couldn't find error message? Should be here: /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n066t01printsd/run/acme.log.170601-095013

ndkeen commented 7 years ago

Using master as of today, I tried an attempt using the edits to streams.ocean and I get the same error (same location).

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n050t01

This one uses 50 nodes (32 OCN and 15 ICE/CPL) to help get in Q faster.

rljacob commented 7 years ago

Try returning the error from allocate/deallocate. Like this:

integer:: ier
deallocate(A, B, C, tracersTemp, stat=ier)

rljacob commented 7 years ago

Although as Phil said the deallocate itself is probably not the problem.

ndkeen commented 7 years ago

OK, I'm trying what Rob suggested (making the change in two deallocates that are similar). I'm also trying with a rebuild using debug flags. I'm also trying with the "slow" files to verify that still works (with jun1st master).

ndkeen commented 7 years ago

Using deallocate(A, B, C, tracersTemp, stat=ier) did not have any different outcome.

ndkeen commented 7 years ago

And fwiw, I also just tried G60to30.T62_oEC60to30v3.GMPAS-IAF to make sure restarts work. They do.

philipwjones commented 7 years ago

@ndkeen I think Rob's suggestion on adding the error flag to deallocate was to also check the value of ier to see if the issue was actually an error deallocating or whether the error was just flagging the last statement of the routine due to a seg fault on exit.

ndkeen commented 7 years ago

OK, I am trying that.

I also see a failure with GNU. I built with some debug flags and using ATP I get a different-looking stack. I'm not certain this is the same issue or not.

ATP Stack walkback for Rank 1622 starting:
  _start@start.S:122
  __libc_start_main@libc-start.c:285
  main@0x40bb0d
  MAIN__@cesm_driver.F90:63
  cesm_comp_mod_mp_cesm_init_@cesm_comp_mod.F90:1184
  component_mod_mp_component_init_cc_@component_mod.F90:239
  ocn_comp_mct_mp_ocn_init_mct_@ocn_comp_mct.f90:529
  ocn_core_mp_ocn_core_init_@mpas_ocn_core.f90:89
  ocn_forward_mode_mp_ocn_forward_mode_init_@mpas_ocn_forward_mode.f90:280
  ocn_time_integration_split_mp_ocn_time_integration_split_init_@mpas_ocn_time_integration_split.f90:2011
ATP Stack walkback for Rank 1622 done

The last line here is 2011 -- I assume it's a divide by zero? Copy/pasting the comments as it looks like someone was wary of layerThicknessSum being 0.

                  ! thicknessSum is initialized outside the loop because on land boundaries                                                                  
                  ! maxLevelEdgeTop=0, but I want to initialize thicknessSum with a                                                                          
                  ! nonzero value to avoid a NaN.                                                                                                            
                  layerThicknessEdge1 = 0.5_RKIND*( layerThickness(1,cell1) + layerThickness(1,cell2) )
                  normalThicknessFluxSum = layerThicknessEdge1 * normalVelocity(1,iEdge)
                  layerThicknessSum = layerThicknessEdge1

                  do k=2, maxLevelEdgeTop(iEdge)
                     ! ocn_diagnostic_solve has not yet been called, so compute hEdge                                                                        
                     ! just for this edge.                                                                                                                   
                     layerThicknessEdge1 = 0.5_RKIND*( layerThickness(k,cell1) + layerThickness(k,cell2) )

                     normalThicknessFluxSum = normalThicknessFluxSum &
                        + layerThicknessEdge1 * normalVelocity(k,iEdge)
                     layerThicknessSum = layerThicknessSum + layerThicknessEdge1

                  enddo
                  normalBarotropicVelocity(iEdge) = normalThicknessFluxSum / layerThicknessSum ! ndk line 2011

ndkeen commented 7 years ago

Compiling with GNU, I still see the error in the deallocate statement. To me, this points to a "memory error". With all of these runs there are a lot of stack trace noise and I try to find one that shows the finest detail. Note the write of ier=0, so this error must be the second time it goes thru function? And, yes, I see that it is failing at different location than the error using Intel above -- it could be two different things. Ah -- but I also built/ran with DEBUG GNU and that shows the same FP error as above.

0159:  ocn_vel_vmix_tend_implicit deallocate ier=           0
0159: *** Error in `/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/acme.exe': corrupted double-linked list: 0x00000
00020779490 ***

....

0159:   at /home/abuild/rpmbuild/BUILD/glibc-2.19/malloc/malloc.c:4029
0159: #6  0xdc462e in __ocn_vmix_MOD_ocn_tracer_vmix_tend_implicit
0159:   at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/ocn/source/core_ocean/shared/mpas_ocn_vmix.f90:459
0159: #7  0xdc58a5 in __ocn_vmix_MOD_ocn_vmix_implicit
0159:   at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/ocn/source/core_ocean/shared/mpas_ocn_vmix.f90:574
0159: #8  0xa2cdc0 in __ocn_time_integration_split_MOD_ocn_time_integrator_split
0159:   at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_time_integrat
ion_split.f90:1721
...

Here is stack for DEBUG GNU build:

0159: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
0159: 
0159: Backtrace for this error:
0159: #0  0x1aff48f in ???
0159:   at /home/abuild/rpmbuild/BUILD/glibc-2.19/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0159: #1  0xd7f8e5 in __ocn_time_integration_split_MOD_ocn_time_integration_split_init
0159:   at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.bgnudebug/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_time_integration_split.f90:2011
0159: #2  0x143c6e7 in __ocn_forward_mode_MOD_ocn_forward_mode_init
0159:   at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.bgnudebug/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_forward_mode.f90:280

ndkeen commented 7 years ago

I think I see a problem with what I've been trying. The editing of the streams.ocean file and using the SourceMods directory is new to me. I did copy/paste what I did & ask about cice. I now suspect that what you want me to try is changing only streams.ocean and ONLY using the new mpaso restart file -- replacing this string in two places -- one for mesh and one for input. NOT what I did, which was to use the mpascice rst file in one of those. I'm trying again -- should run in debug.

rljacob commented 7 years ago

To follow up on the deallocate error flag, yes after each line where you return the error, add a line like: if(ier /= 0) write(0,*) "deallocate error", ier

ndkeen commented 7 years ago

Trying the correct edit to streams.ocean, I can now get the job to complete one day. The total time in Init is now 459 secs. This is for a 50-node run, where OCN has 32 nodes. Previous runs with this layout were measuring 6500 seconds for Init. So this is a 13x speedup. Also this is with June 1st master.

Sorry about the mistake, I acted too quickly. I saw 2 new files and 2 locations that needed to be changed, so that's what I did.

Should we copy these files into /project and make this change?

PeterCaldwell commented 7 years ago

So you've reduced spin-up time from nearly 2 hrs to 8 min? NICE!

ndkeen commented 7 years ago

Well, the "Init Time", yes. And that was for 50-nodes. Previously, the Init time was somewhat scaling with nodes, so I'd like to also see a small Init time when I use more nodes.

I had several other jobs stop after running out of time -- and then realized I still had DEBUG=TRUE on from a previous experiment. Resubmitting. But, it also looks like we no longer see the same FP issues I described above -- surely related to my error in streams.ocean.

My earlier suggestion to "copy the files to /project" -- I really meant to the servers -- and "is this an acceptable solution? or is this just to a test to verify this is the problem?"

worleyph commented 7 years ago

@mark-petersen , I changed permissions so that directory /lustre/atlas1/cli115/scratch/worley/g18to6.T62_oRRS18to6v3.GMPAS-IAF.titan_pgi3/run is readable. Don't know whether this is permanent or not. It you try and can't get in, I'll next copy this elsewhere. I am going to run some more experiments in this directory, and moved the output of the previous run, with the error messages, to

/lustre/atlas1/cli115/scratch/worley/g18to6.T62_oRRS18to6v3.GMPAS-IAF.titan_pgi3/run/170601-014551

worleyph commented 7 years ago

@ndkeen , did your run with the new ocean initialization file generate log.ocean.####.err files?

ndkeen commented 7 years ago

Yes, there are always many log.ocean files. All with the same message: ERROR: Warning: abs(sum(h)-bottomDepth)>2m. Most likely, initial layerThickness does not match bottomDepth.

worleyph commented 7 years ago

Think that this needs to be resolved before starting to use these files for ocean (and sea ice) initialization. Perhaps there is version skew between master and the version of the code that @vanroekel and @mark-petersen are using?

ndkeen commented 7 years ago

OK, sounds good.

For a hires coupled case that uses 319 nodes total (a smaller overall PE layout than reported above, though still uses 150 nodes each for ICE and OCN), the "Init" time dropped from 8015 seconds to 1103 seconds.

Can someone explain again what the issue was? The original files were in a different netcdf format that caused ACME to read them using a different algorithm which was very slow? Could there be other files where this might happen?

I recall at one point that there were some MPAS folks requesting pnetcdf 1.5 be installed on Cori. I was unable to convince NERSC to do that, but I can try to help get around whatever the issue was. We are currently using pnetcdf 1.7 and have 1.8 installed (which I tried, but nothing different happens).

I can now use the debug Q to run hires G-case tests (at least for one day with no restarts). I might start a different github issue regarding the failing restarts with hires G case.

vanroekel commented 7 years ago

@ndkeen and @worleyph for your restart failures, are you running G-cases? If so, can you check that PIO_TYPENAME = netcdf for all data components and the coupler? I had G case failure unless PIO_TYPENAME was changed for these components (see #1451 ). Note netcdf was the default for G cases in CIME5.1.

ndkeen commented 7 years ago

I moved the restart conversation to https://github.com/ACME-Climate/ACME/issues/1574

vanroekel commented 7 years ago

@ndkeen regarding pnetcdf, our restart file violates cdf2 file size constraints (> 4Gb for one variable). In pnetcdf/1.5.0 there was a bug that allowed us to write files in spite of this (there was no variable size check). If master is used, there is no need to have version 1.5.0. The catch is you need to use files produced by pnetcdf > 1.5.0 (the restarts I link to above). I can send the necessary streams changes for sea ice if you'd like to add the tests for that.

worleyph commented 7 years ago

@vanroekel , this failure is when creating and writing to a new restart file, not writing an old one?

vanroekel commented 7 years ago

yes, this was in writing a new cpl restart file.

worleyph commented 7 years ago

Still confused. I am usiing both your ocean and sea ice restart files, and the problem is showing up in cpl restart write. Not sure what you are telling me. Or this has nothing to do with the cpl restart write error?

vanroekel commented 7 years ago

For me, in a G-case, when I set PIO_TYPENAME = netcdf for cpl in env_run.xml, I can successfully write coupler restarts

ndkeen commented 7 years ago

Right -- I also see the restarts working in G case when I do ./xmlchange CPL_PIO_TYPENAME=netcdf, which I note in github #1574. A coupled test has yet to run.

ndkeen commented 7 years ago

FYI, I also updated the streams.cice file to point to the file Luke provided and tried another hires G case. It runs OK, but I don't see any noticeable performance diff AND I still get the log.ocean* files. Luke mentioned there might be another mod to make in one of these files to make it more consistent.

jonbob commented 7 years ago

@ndkeen - is this issue resolved?

ndkeen commented 7 years ago

Yes. Now there may still be an issue that the PIO code is not waving its arms more frantically when it drops down to a lesser (serial) method of reading large files. However, using MPAS restart files that were written with a newer version of pnetcdf seems to have fixed things for us and it's better going forward.

E3SM-Project / E3SM

Slow init time in OCN for hires problems (g18to6.T62_oRRS18to6v3.GMPAS-IAF) #1547