E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
351 stars 360 forks source link

Slow init time in OCN for hires problems (g18to6.T62_oRRS18to6v3.GMPAS-IAF) #1547

Closed ndkeen closed 7 years ago

ndkeen commented 7 years ago

Generic issue, but wanted a place to store notes and we will surely have some code changes to help diagnose/address.

I'm running on cori-knl, but I understand others have seen similar slow init times. I'm still trying to figure out if this is highly dependent on number of total MPI's, number of MPI's for a given component, other settings, etc. The following is for a G-case where I'm giving 150 nodes to OCN and 150 to ICE+CPL. On each node, using 64 MPI's (pure MPI mode) for a total of 9600 tasks on each of the MPAS components. I also see even slower init times for coupled hi-res problem that is using similar MPAS setup (afaik), the same number of nodes to each component, but more total nodes in the job -- however, as it fails in restart, I don't have complete timing files.

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        9600        9600     9600   x 1       1      (1     ) 
  atm = datm       128         19200    128    x 1       1      (1     ) 
  lnd = slnd       128         19200    128    x 1       1      (1     ) 
  ice = mpascice   9600        9600     9600   x 1       1      (1     ) 
  ocn = mpaso      9600        0        9600   x 1       1      (1     ) 
  rof = drof       128         19200    128    x 1       1      (1     ) 
  glc = sglc       128         19200    128    x 1       1      (1     ) 
  wav = swav       128         19200    128    x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 
...

    Init Time   :    1969.944 seconds 
    Run Time    :    1397.276 seconds      279.455 seconds/day 

The timer comp_init_cc_ice reports 1467 which would indicate most of the time in OCN init. And o_i:PIO:pio_read_nfdarray_double has 898 seconds from the following file:

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m34-may17.n302t01/run/timing.170519-000252/model_timing.00000

Assuming the value of pnetcdf is being honored for PIO_TYPENAME that is set in env_run.xml, then it looks like this is the only call that could be causing the time:

          ierr=nfmpi_get_vara_all( File%fh,varDesc%varid, &
               start, &
               count, &
               IOBUF,iodesc%Read%n_ElemTYPE, &
               iodesc%Read%ElemTYPE)

I have some MCT files that @worleyph has modified to add more timers (though I need to backport that with the incoming MCT 2.0 files). I have jobs in the Q currently.

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m34-may17.n302t01

fyi, when I tried to a 32-node hi-res problem, giving OCN 15 nodes and ICE+CPL 15 nodes, I did a top on each compute node while it was in init. The first 15 nodes are using about 53 GB, while the remaining nodes are using 5GB.

ndkeen commented 7 years ago

I ran with addl timers suggested by Pat. I have these 3 runs: 66 nodes, 130 nodes, 302 nodes. Using 66 total nodes (32 nodes OCN, 32 nodes ICE+CPL), which is a small number of nodes for the hires problem, but still managed to simulate one day (1318 secs) and finish. The Init time was ~ 2 hours! Here is the relevant portion of profile regarding the OCN init:

        "comp_init_cc_ocn"                                    -        1    -    5367.673340  5367.673340  5367.673340         0.000001 
          "o_i:mpas-o_init1"                                  -        1    -     155.868896   155.868896   155.868896         0.000001 
*           "o_i:PIO:PIO_openfile"                            -        5    -      88.725006    23.301487     0.117354         0.000006 
            "o_i:PIO:PIO_closefile"                           -        4    -       1.705017     1.132979     0.009023         0.000005 
          "o_i:mpas-o_init2"                                  -        1    -    5195.070801  5195.070801  5195.070801         0.000001 
            "o_i:io_read"                                     -        2    -    5192.303711  5192.303711     0.000112         0.000002 
*             "o_i:PIO:PIO_initdecomp_dof"                    -       20    -       4.456758     1.344152     0.026214         0.000023 
                "o_i:PIO:pio_rearrange_create_box"            -       20    -       4.043386     1.298635     0.025924         0.000023 
*             "o_i:PIO:pio_read_darray"                       -       55    -    5163.808594  3063.560303     0.003162         0.000064 
                "o_i:PIO:pio_read_nf"                         -       55    -    4346.721191  2594.627197     0.002382         0.000064 
                  "o_i:PIO:pio_read_nfdarray_int"             -       23    -       0.589408     0.161079     0.002902         0.000027 
                  "o_i:PIO:pio_read_nfdarray_double"          -       32    -    4346.088379  2594.627197     0.002371         0.000037 
                "o_i:PIO:pio_rearrange_read"                  -       55    -     816.529358   468.899292     0.000117         0.000064 
                  "o_i:PIO:pio_rearrange_io2comp_int"         -       23    -       1.381380     0.229733     0.000109         0.000027 
                    "o_i:PIO:swapm_box_rear_io2comp_int"      -       23    -       1.351935     0.229563     0.000045         0.000027 
                  "o_i:PIO:pio_rearrange_io2comp_double"      -       32    -     815.136353   468.899261     0.051147         0.000037 
                    "o_i:PIO:swapm_box_rear_io2comp_double"   -       32    -     815.106873   468.897614     0.051082         0.000037 
              "o_i:PIO:pio_get_vara_1d_double"                -        1    -       0.003191     0.003191     0.003191         0.000001 
              "o_i:analysis_bootstrap"                        -        1    -       0.000184     0.000184     0.000184         0.000001 
            "o_i:reset_io_alarms"                             -        2    -       0.005811     0.004992     0.000819         0.000002 
            "o_i:diagnostic solve"                            -        1    -       0.246344     0.246344     0.246344         0.000001 
              "o_i:equation of state"                         -        3    -       0.019708     0.010630     0.004524         0.000004 
            "o_i:analysis_init"                               -        1    -       0.000260     0.000260     0.000260         0.000001 
          "o_i:mpas-o_mct_init"                               -        1    -       1.539426     1.539426     1.539426         0.000001 
          "o_i:land_ice_build_arrays"                         -        1    -       0.000007     0.000007     0.000007         0.000001 
          "o_i:frazil"                                        -        1    -       0.004920     0.004920     0.004920         0.000001 

Again pointing to o_i:PIO:pio_read_nfdarray_double. With an enormous diff between min/max times -- but I'm not sure if that includes ranks not doing OCN init?

Looking at comp_init_cc_ocn in all 3 model_timing.0000 files, it's clear that the init time is affected by the number of tasks:

15 nodes in OCN: 5368 seconds
30 nodes in OCN: 2174 sec
150 nodes in OCN: 1107 secs

Case dirs:
66 nodes (32 OCN, 32 ICE+CPL) /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n066t01 130 nodes (64 OCN, 64 ICE+CPL) /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n130t01 302 nodes (150 OCN, 150 ICE+CPL) /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n302t01

worleyph commented 7 years ago

@ndk , these data are from a single process, so the difference between max and min refers to different calls of this timer (called 32 times). If you look at the stats file, the timer is called only from ocean processes, and all processes in the ocean call this timer.

ndkeen commented 7 years ago

Oh right, because I'm looking at profile of rank0 only.

jonbob commented 7 years ago

@ndkeen - what version of the code are you using? We made changes to the pio handling of ocn high-res files that went to master on 5/16. I just want to make sure you have those changes.

ndkeen commented 7 years ago

Using master of May 18th. But I also tried next as of May 19th (with new MCT files) and I see similar performance. I have some new data and will report soon.

ndkeen commented 7 years ago

OK I think I have a pretty good clue about what's going on.

In cime/src/externals/pio1/pio/pionfread_mod.F90.in the function integer function read_nfdarray_{TYPE} seems to be where a lot of time is going (from our own profiles). When using PNETCDF (which we are), there is really only one thing this function does, which is to call nfmpi_get_vara_all. This appears to be an internal PNETCDF function call. So then I put start/stops around this function only, and wrote the values to standard out for all ranks/calls. I don't see any major time! So looking again at read_nfdarray_, there is one other potential problem that happens at the end of the function:

call check_netcdf(File, ierr,__PIO_FILE__,__LINE__);

I commented this line and re-ran. The overall Init time is still very slow. However, the time measured in o_i:PIO:pio_read_nfdarray_double (reported above) is now near 0 and there is another timer that is very slow. I suspect this means that check_netcdf is doing something bad and whoever calls it (or is first to call?) will suffer. This is a guess! I prefer to make measurements to verify, but someone else may know more about this function?

With that line commented, there is now 4200 seconds in timer o_i:PIO:swapm_box_rear_io2comp_double

worleyph commented 7 years ago

@ndk , In an earlier investigation into init slowness in hires runs on cori-knl, it appeared that the first time a nonlocal mpi operator (even implemented with point-to-point) is called can be very slow, as the internal mpi space is expanded (?) to handle new sources and destinations. I had an alltoall benchmark that showwd something like this, though not as bad as we saw in ACME. Wonder if there is some "first time" mpi happening within one of these calls. Note that i gave up after this duagnosis as there is nothing that we can do about this, except ask that the mpi developers look at this issue. Perhaps time for a help ticket?

ndkeen commented 7 years ago

If we can create a small benchmark, can certainly create a ticket.

worleyph commented 7 years ago

I'll dig it out again when I get the chance. You might also rerun with an mpi_barrier surrounded by timers just before the expensive timer, to test whether some process is slow to arrive there. Make sure to use the same communicator as is relevant to the rest of the subroutine call.

worleyph commented 7 years ago

Might also add a barrier after the slow timer, to stop this from "contaminating" the other timers.

ndkeen commented 7 years ago

I have been adding timers and a barrier and you are right -- the bad actor is not where I thought. I added barrier before a swapm that is known to be slow and now the time is showing up in the barrier.

worleyph commented 7 years ago

The barrier is also an mpi command. Check the stats flle to see if all processes are slow-ish (min also significant and/or average not too much different than min), or whether rhe min is small and one process is late to arrive.

ndkeen commented 7 years ago

yes, I know mpi_barrier() is an MPI command. I'm adding my own timers and barriers to the code. call MPI_BARRIER(IOsystem%union_comm,ierror)

One thing is odd: In env_run.xml, we have <entry id="PIO_REARR_COMM_TYPE" value="p2p">, yet in the code, it's clearly in the code for collectives, ie if (pio_option == COLLECTIVE) and uses the ALLTOALLW. There's some logic there that could be overriding the p2p. Should it be trying to use collectives or point-to-point?

worleyph commented 7 years ago

Didn't mean to imply that you didn't know that this was an mpi command. The issue is that if the problem is an mpi start up cost, them it could show ip in the barrier call itself.

worleyph commented 7 years ago

It should not be calling alltoallw I don't think. How do you know what is happening?

ndkeen commented 7 years ago

Because I thought my print statements said it was there, but then I realized I missed a critical else and now see it's not calling that alltoallw. I am also trying a test with coll just to see if that behaves differently.

I am trying to analyze my raw results. I do see that there are about 30 calls to swapm that are quick and then calls that are slow/fast/slow, then really slow.

OK, during Init, on the 51st and 52nd call to the rearrange (calling swapm), all OCN ranks have a major issue at a barrier that I placed directly before swapm. Some more than others. Now I need to find out what is happening just before this call -- however, I wanted to point out something that might mean something more to Pat. On every line that I'm writing out, I also include a few integers. Let's just look at the 52nd call -- all 2048 ranks (that are OCN) make the call. There are 32 of these 2048 ranks (one per node?) that have a large value for niodof ndof = 179400 niodof= 11541300 nrecvs= 2048 while all the other ranks have niodof=1 and nrecvs=0. IN fact, it looks like it is on the rank boundary plus 1 (ie rank 1, then on rank 65, etc). I recall someone saying this is where the PIO lives. For the ranks with a large value of niodof, the barrier is 30 seconds (still large -- except for rank 1, which has near 0 time), while all other ranks spend 10 minutes in a single barrier. All other times this barrier before swapm is called (the first 50 times) are small in time -- max of about .2 seconds. However, all other calls have much smaller values of niodof as well.

ndkeen commented 7 years ago

Last comment was a little in the weeds, but I have been doing more trial-n-error testing -- as well as adding more timing. I do see that the check_netcdf file is certainly one issue at 10 minutes for all non-IOprocs, and only 30 seconds for IO procs (I put barrier before/after). However, thats only 10 minutes. This is what I can do in the debug Q. I have to let it run longer to get more timing and I have them in Q.

worleyph commented 7 years ago

I'm going to try to get on tonight. Are the runs in the performance archive? If so, what cass name and $lid should I look at? If not, can you put the raw timing data some place that Ican access it? Thanks.

ndkeen commented 7 years ago

In /global/cscratch1/sd/ndk/acme_scratch/cori-knl I have several cases all very similar. They have "prints" in the name.

g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n066t01printsd

worleyph commented 7 years ago

@ndk, fyi, I ran latest master with your compset and resolution and PE layout and user_nl_mpascice and user_nl_mpaso and user_nl_cpl and PIO stride and numpes on Titan, and initialization cost is around 800 seconds (comp_init_cc_ocn is 675 seconds).

Like your results from Cori-KNL, comp_init_cc_ocn cost is also primarily in o_i:PIO:pio_read_nfdarray_double (594 seconds), but it does not appear to be in check_netcdf (and so not in the MPI_BCAST in check_netcdf). I'm still doing some experiments to clarify exactly where the cost is coming from, but it does not appear that Titan will be directly useful in diagnosing the performance issues on Cori-KNL. I'll move this same code version and test to Cori-KNL when I get the chance, but it is probably almost identical to the experiments that you are running.

ndkeen commented 7 years ago

Well that's good news. I would think that means we should be able to get rid of this extra time on this machine. So far, I've only had runs finish that I submitted in debug Q, therefore, they only ran 30 minutes, wrote some diagnostic data to acme.log, and stopped. I did add another barrier or two and waiting for more timer data.

Pat and I have also been exchanging emails.

I'm going to be offline for a while...

ndkeen commented 7 years ago

One thing I just noticed. As noted, I give OCN 32 nodes, ICE/CPL 32 nodes, and then all other components get 2 nodes. I have been neglecting to look at what is happening on those 2 nodes. Here are the model_timing files -- 3 are written, one for OCN, one for ICE/CPL, and the last for the rest (that's my understanding). The OCN init is large for 2, but not the other. And then why is GLC init so large for model_timing.4096?

cori11% grep comp_init_cc_ocn model_timing.*
model_timing.0000:        "comp_init_cc_ocn"                                              -        1    -    5538.289062  5538.289062  5538.289062         0.000001 
model_timing.2048:        "comp_init_cc_ocn"                                            -        1    -    5525.876953  5525.876953  5525.876953         0.000001 
model_timing.4096:        "comp_init_cc_ocn"                                                -        1    -       0.000008     0.000008     0.000008         0.000001 

cori11% grep comp_init_cc_glc model_timing.*
model_timing.0000:        "comp_init_cc_glc"                                              -        1    -       0.000004     0.000004     0.000004         0.000001 
model_timing.2048:        "comp_init_cc_glc"                                            -        1    -       0.007821     0.007821     0.007821         0.000001 
model_timing.4096:        "comp_init_cc_glc"                                                -        1    -    5633.349121  5633.349121  5633.349121         0.000001 

Am I doing anything terrible here with pe layout:

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        2048        2048     2048   x 1       1      (1     ) 
  atm = datm       128         4096     128    x 1       1      (1     ) 
  lnd = slnd       128         4096     128    x 1       1      (1     ) 
  ice = mpascice   2048        2048     2048   x 1       1      (1     ) 
  ocn = mpaso      2048        0        2048   x 1       1      (1     ) 
  rof = drof       128         4096     128    x 1       1      (1     ) 
  glc = sglc       128         4096     128    x 1       1      (1     ) 
  wav = swav       128         4096     128    x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 
worleyph commented 7 years ago

@ndkeen , I'll start looking at your recent data, but I also had a successful run in which I used 2048x1 for all components and stacked them, to eliminate any interpretation issues from a concurrent run. The initialization time for this setup was 4118 seconds, so "only" 68 minutes. Not sure why it was faster, but I am going to look into this and see if I can tease out anything. FYI.

worleyph commented 7 years ago

@ndk, the large comp_init_cc_glc for process 4096 is because this process is not running the ocean or ice, so it passes through comp_init_cc_ocn and comp_init_cc_ice immediately and is sitting in comp_init_cc_glc waiting for the others to finish (since it requires the CPL component in order to complete its initialization, and CPL is busy doing OCN and ICE before it gets to GLC). Initialization is not really concurrent even though the components are assigned to the different processes - concurrency is only useful for the timestepping loop.

worleyph commented 7 years ago

@ndk, first glance at my recent run indicates that almost all of the cost is (still) in o_i:PIO:pio_read_nfdarray_double on the OCN I/O processes. The other processes are stuck at the added timer/barrier I named pre_pio_read_nfdarray_double_chk_bar, waiting for the I/O processes to finish reading the data (so not inside check_netcdf). It doesn't appear to be for all PIO reads however (some are fast). I'd like to figure out which reads are slow. I'll try adding some more timers and resubmitting.

ndkeen commented 7 years ago

ah, so I should not try tracking down why comp_init_cc_glc is slow?

When you say 4118 seconds, was that still on titan?

Also, I just realized that I was NOT using the following 3 files -- I copied them to externals/mct/mct, where they sat uncompiled.

cime/src/drivers/mct/main/cesm_comp_mod.F90
cime/src/drivers/mct/main/component_mod.F90
cime/src/drivers/mct/main/seq_map_mod.F90

I'm now trying them.

worleyph commented 7 years ago

ah, so I should not try tracking down why comp_init_cc_glc is slow?

No. It is not a mystery.

When you say 4118 seconds, was that still on titan?

No. This is Cori-KNL, using the new PE layout. Titan (with the original PE layout) was taking only 15 minutes for the initialization.

worleyph commented 7 years ago

Here's a question ... ( @jonbob ? @ndkeen ?),

ocean initialization (based on some added timers) is using:

   case(pio_iotype_netcdf4p)
      ierr= nf90_get_var(File%fh, vardesc%varid, iobuf, start=int(start),count=int(count))

while sea ice initialization is using:

   case(pio_iotype_pnetcdf)
      ierr=nfmpi_get_vara_all( File%fh,varDesc%varid, &
           start, &
           count, &
           IOBUF,iodesc%Read%n_ElemTYPE, &
           iodesc%Read%ElemTYPE)

so NETCDF4 for OCN and PNETCDF for ICE.

Is this intended? The same is true on Titan (where o_i:PIO:pio_read_nfdarray_double costs 10 minutes instead of 60 minutes, so doesn't really explain the high cost), but am curious as to why the different PIO layers are being selected.

ndkeen commented 7 years ago

I also noticed that there are several calls that use iotype=8 instead of iotype=5. But I read somewhere that if the files is netcdf4, it will always use type 8.

Note: I'm logged into compute node of running job. Most of them are stuck at a barrier in the usual suspect -- read_nfdarray, however, at lease one of the processes is at a Bcast:

#0  0x0000000001cb7b23 in GNI_CqGetEvent ()
#1  0x000000000188160f in MPID_nem_gni_check_localCQ ()
#2  0x000000000188206a in MPID_nem_gni_poll ()
#3  0x000000000186a208 in MPIDI_CH3I_Progress ()
#4  0x000000000183461e in MPIC_Wait ()
#5  0x00000000018348a7 in MPIC_Recv ()
#6  0x0000000001915cec in MPIR_CRAY_Bcast_Tree ()
#7  0x00000000019166fc in MPIR_CRAY_Bcast ()
#8  0x0000000001822bcb in MPIR_Bcast_impl ()
#9  0x000000000181dfe8 in MPIR_Allreduce_intra ()
#10 0x0000000001913152 in MPIR_CRAY_Allreduce ()
#11 0x000000000181ecf1 in MPIR_Allreduce_impl ()
#12 0x000000000184110a in MPIR_Get_contextid_sparse_group ()
#13 0x000000000183f6e3 in MPIR_Comm_copy ()
#14 0x0000000001827583 in MPIR_Comm_dup_impl ()
#15 0x0000000001907073 in MPIDI_CRAY_Setup_Shared_Mem_Coll ()
#16 0x00000000019167c5 in MPIR_CRAY_Bcast ()
#17 0x0000000001822bcb in MPIR_Bcast_impl ()
#18 0x00000000018232fb in PMPI_Bcast ()
#19 0x000000000183294a in pmpi_bcast__ ()
#20 0x000000000127860a in shr_mpi_mod_mp_shr_mpi_bcastl0_ ()
#21 0x000000000114ca9e in seq_infodata_mod_mp_seq_infodata_exchange_ ()
#22 0x000000000042dc15 in component_mod_mp_component_init_cc_ ()
#23 0x000000000041d4b7 in cesm_comp_mod_mp_cesm_init_ ()
#24 0x000000000042a842 in MAIN__ ()
#25 0x000000000040bb0e in main ()

Is it just the first bcast?

worleyph commented 7 years ago

About the MPI_BCAST, no idea, but something to keep in mind.

@jonbob , would it be possible to generate netcdf3 versions of the ocean data files to see if performance improves? (Sounds like an expensive or impossible request, but wanted to at least ask if this was easy to do or not.)

worleyph commented 7 years ago

Guess that we could also run an old compset that uses pre-v3 data sets and check whether initialization performance is the same or not? Does this make sense?

ndkeen commented 7 years ago
cori11% grep type run/ocn.log.170530-163120
 Using io_type Parallel-NetCDF (CDF-5, large variable support) for mesh stream
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF
         I/O type:           Serial NetCDF
         I/O type:           Serial NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF (CDF-5, large variable support)
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF
         I/O type:           Parallel-NetCDF

And the "serial" mentions:

-----  found stream "globalStatsOutput" in streams.ocean  -----
         filename template:  mpaso.hist.am.globalStats.$Y-$M-$D.nc
         filename interval:  00-01-00_00:00:00
         clobber mode:       truncate
         I/O type:           Serial NetCDF
         direction:          output
         reference time:     initial_time
         record interval:    -
         output alarm:       00-00-01_00:00:00
         package:            globalStatsAMPKG

 -----  found stream "surfaceAreaWeightedAveragesOutput" in streams.ocean  -----
         filename template:  mpaso.hist.am.surfaceAreaWeightedAverages.$Y-$M-$D.nc
         filename interval:  00-01-00_00:00:00
         clobber mode:       truncate
         I/O type:           Serial NetCDF
         direction:          output
         reference time:     initial_time
         record interval:    -
         output alarm:       00-01-00_00:00:00
         package:            surfaceAreaWeightedAveragesAMPKG
worleyph commented 7 years ago

If we can identify which reads are causing the problem, it may align with the I/O types. I'll continue trying to look at that.

vanroekel commented 7 years ago

@worleyph and @ndkeen the IO choices in the ocean are due to a few variables violating cdf-2 (netcdf3) size constraints at high-res. We cannot run a netcdf3 version of the high res ocean for this reason. The model will not read or write the model state correctly (see #1456) for more information. The only exception to the netcdf4 requirement is if you compile the model with pnetcdf 1.5.0, which has a bug in the variable size check that allows us to write and read at high res.

worleyph commented 7 years ago

@vanroekel , thanks for the explanation. So, there is no pnetcdf support for netcdf4 file formats? Will there be such support some day?

vanroekel commented 7 years ago

Sorry for being unclear, pnetcdf does support netcdf4 (I am using it without issue in the high-res g-case), but we cannot use netcdf3 in pnetcdf for the high res ocean.

worleyph commented 7 years ago

Sorry for not knowing what I am talking about ... all ocean initialization I/O calls appear to go through the branch

 case(pio_iotype_netcdf4p)
       ierr= nf90_get_var(File%fh, vardesc%varid, iobuf, start=int(start),count=int(count))

and this is within the #ifdef _NETCDF code branch. The sea ice initialization goes through

 case(pio_iotype_pnetcdf)
       ierr=nfmpi_get_vara_all( File%fh,varDesc%varid, & 
      ...

and this is within #ifdef _PNETCDF code branch. Perhaps both (_NETCDF, _PNETCDF) are defined. But pio_iotype_pnetcdf is only netcdf3? and pio_iotype_netcdf4p parallel reads in nfmpi_get_vara_all have similar performance as nf90_get_var? Thanks.

vanroekel commented 7 years ago

I'm certainly no expert here either! I think some confusion I have is that we are talking about cdf2 versus cdf5, not netcdf3 versus netcdf4. Netcdf4 supports cdf5 (netcdf3 does not), but they are not the same.

For example, taking a quick look at mpas_io.F in mpas framework, for our large file support we go down the following bit of code for PIO/pnetcdf initialization

         else if (ioformat == MPAS_IO_PNETCDF5) then 
            pio_iotype = PIO_iotype_pnetcdf
            pio_mode = PIO_64BIT_DATA

Where ioformat is set by io_type in streams.ocean. For our large files this is pnetcdf,cdf5. So I think that the ocean will also follow the same path as sea-ice initialization (your second block of code above). The key is that we are not using netcdf4 here, but parallel-netcdf with cdf5 support.

worleyph commented 7 years ago

Looking at the timing output, ocean and sea ice are not doing the same thing (at least during initialization). In fact, ocean reads all streams with one PIO option (pio_iotype_netcdf4p), and sea ice reads all streams with a different PIO setting (pio_iotype_pnetcdf). This does not seem consistent with the logic that you are describing. Any suggestions on how to determine what is happening?

vanroekel commented 7 years ago

This does not seem right to me. The way I look at mpas_io.F, the only way to get to pio_iotype_netcdf4p is to have select netcdf4 for io_type in streams.ocean, which is never done. So I have no idea how mpas-o is using this part of PIO. Perhaps @jonbob or @mark-petersen may have a thought on how to pursue this further.

ndkeen commented 7 years ago

Oh now wait a minute. What Pat is saying is now sinking in -- in the function: integer function read_nfdarray_{TYPE}, I've been trying to find where all of the time is going. I've been ignoring the code under #ifdef _NETCDF because I thought we were using PNETCDF, not both. I will now add timers there -- could explain a lot!

worleyph commented 7 years ago

@ndkeen , don't know if you have seen this, but one of my successful runs (1 day, regular queue, 2 hour limit) has also failed twice, running out of time during ocean initialization - appears to have hung or, at least, taking twice as long as the successful run. So not only is performance slow, it is also highly variable. This is a "small" (32 node) case. I did modify the code, but only adding timers in this latest run - no new barriers.

worleyph commented 7 years ago

@ndkeen , with Cori-KNL down today, I moved my experiments back to Titan. Even though 6-10X faster, it may still identify the location of the performance issue.

Almost all of the ocean initialization time is spent in reading oRRS18to6v3.170111.nc . I added a timer for each stream read that showed up in the routine read_stream, and number 1 (00001) was for the above file. The timer o_i:PIO:pio_read_darray is called 55 times by process 1, but most of the time is in reading the above file:

            "o_i:read_stream_readStream00001"  -        1    -     635.074707   635.074707   635.074707         0.000000
              "o_i:PIO:pio_read_darray"      -       55    -     636.396301   369.389069     0.024624         0.000013

where 46 of the calls to o_i:PIO:pio_read_darray are from o_i:read_stream_readStream00001

   9 "o_i:mpas-o_init1"
  46 "o_i:read_stream_readStream00001"
  55   "o_i:PIO:pio_read_darray"

Still want to figure out which of these calls to o_i:PIO:pio_read_darray is so expensive, since one call represents over half of the total cost. Most of this cost is in

                  "o_i:PIO:pio_read_nfdarray_double"   -       32    -     631.317505   369.085114     0.068762         0.000007

and in

                    "o_i:PIO:pio_read_nfdarray_double_B"   -       32    -     614.154175   360.545288     0.001225         0.000007

which wraps the call

      ierr= nf90_get_var(File%fh, vardesc%varid, iobuf, start=int(start),count=int(count))

Little time is spent in the check_netcdf routine.

I'll keep poking.

worleyph commented 7 years ago

Okay ... In the PIO funtion open_nf (in file ionf_mod.F90) is logic to try to open a file using a given PIO type. If this fails, then the PIO type is changed. For oRRS18to6v3.170111.nc, the PIO file type starts as '5' (== PIO_iotype_pnetcdf). However, the error return from

 ierr  = nfmpi_open(File%iosystem%IO_comm,fname,amode,File%iosystem%info,File%fh)

is -51, and the PIO type is then changed to '8' (== pio_iotype_netcdf4p).

A question for @jayeshkrishna (when he returns), but does anyone know if there are performance implications for this change in PIO type?

Note:

 integer(i4), public, parameter :: PIO_INTERNAL_ERROR = -51

so this is not a very informative error return code :-).

ndkeen commented 7 years ago

Thanks Pat. Thanks great you can track it back to a filename. I would like to be able to have ability to know which file is being opened/read and how much time associated. I would think we would ideally want that as part of diagnostics anyway?

I've been printing out the error code as well (-51). I had not pieced together that this was happening because it tried to open it one way, failed, and is trying a different way. Nice work.

I note that most of those 55 calls use iotype=5, while the rest use 8. Are you saying it's all because of one file?

worleyph commented 7 years ago

I'm away from my laptop at the moment. I'm surprised that most of these 55 calls are with type 5. As far as I can tell, 46 are for reading this one file, and the issue shows up for the file open. I didn't check how nany times the file was opened - assumed just once (per io process). I'll look at the data again later today.

worleyph commented 7 years ago

Looking at the output from my latest Titan run, the only file that is failing the file open (-51) is oRRS18to6v3.170111.nc . Each PIO process opens this file 4 times (and it fails each time for each process).

worleyph commented 7 years ago

The pnetcdf error (-51) is

 PARAMETER (NF_ENOTNC       = -51)   ! Not a netcdf file

Not sure what this means for this file.

bishtgautam commented 7 years ago

oRRS18to6v3.170111.nc is a 27GB file. Is there a limit on file size for pnetcdf?

vanroekel commented 7 years ago

@bishtgautam, there is no file size limit here, but with non cdf5 files there is a 4Gb limit on individual variables.

@ndkeen and @worleyph I had a thought on the pnetcdf error. You are both using an extremely old initial condition here. It is technically a cdf2 file that violates the size constraints (takes advantage of a bug in parallel-netcdf/1.5.0). I wonder if using an updated parallel-netcdf version and forcing the io_type to pnetcdf,cdf5 may cause issues when reading this file. Could you try a test pointing at this file instead For the ocean

/global/cscratch1/sd/lvroekel/mpaso.rst.0039-01-01_00000.nc

and for the ice condition use

/global/cscratch1/sd/lvroekel/mpascice.rst.0039-01-01_00000.nc

Although for a short test such as these, you may not need to worry about changing the ice initial condition.

Both of these files are at NERSC.