E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
334 stars 334 forks source link

A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864

Closed worleyph closed 7 years ago

worleyph commented 8 years ago

I've been trying to find feasible PE layouts for

 -compset A_WCYCL2000 -res ne120_oRRS15

on Cori. I started with a small (1024x1, stacked, noHT) layout, which failed. I then tried 2048x1, stacked, noHT), and most recently 3600x1 for atmosphere, coupler, and land (3616x1) with the other components on their own compute nodes using a 2048x1 decomposition. Again this is all noHT:

 <entry id="MAX_TASKS_PER_NODE"   value="32"  />

I am getting the identical error for all three of these. From cesm.log:

 0000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 0000: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
 0000: 000.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
 0000: Rank 0 [Sat Apr 23 19:39:16 2016] [c1-0c1s13n1] application called MPI_Abort(MPI_COMM_WORLD, 2) - process 0
 0000: forrtl: error (76): Abort trap signal
 ...
 0000: cesm.exe           0000000002F2670F  m_dropdead_mp_die          87  m_dropdead.F90
 0000: cesm.exe           0000000002F258DF  m_die_mp_die2__           165  m_die.F90
 0000: cesm.exe           0000000002EA8106  m_globalsegmap_mp        2433  m_GlobalSegMap.F90
 0000: cesm.exe           0000000002EF9A08  m_router_mp_initp         364  m_Router.F90
 0000: cesm.exe           0000000002EECE6C  m_rearranger_mp_i         153  m_Rearranger.F90
 0000: cesm.exe           0000000002EE3FDF  m_sparsematrixplu         522  m_SparseMatrixPlus.F90
 0000: cesm.exe           0000000002C52CF3  shr_mct_mod_mp_sh         355  shr_mct_mod.F90
 0000: cesm.exe           00000000004B94DD  seq_map_mod_mp_se         191  seq_map_mod.F90
 0000: cesm.exe           000000000044BC91  prep_ocn_mod_mp_p         259  prep_ocn_mod.F90
 0000: cesm.exe           0000000000411C7A  cesm_comp_mod_mp_        1582  cesm_comp_mod.F90

cpl.log ends with

 (seq_mct_drv) : Initialize each component: atm, lnd, rof, ocn, ice, glc, wav
 (component_init_cc:mct) : Initialize component atm
 (component_init_cc:mct) : Initialize component lnd
 (component_init_cc:mct) : Initialize component rof
 (component_init_cc:mct) : Initialize component ocn
 (component_init_cc:mct) : Initialize component ice
 (component_init_cc:mct) : Initialize component glc
 (component_init_cc:mct) : Initialize component wav

 ...

 (prep_ocn_init) : Initializing mapper_Sa2o
 (seq_map_init_rcfile)  called for mapper_Sa2o initialization

 (shr_mct_sMatPInitnc) Initializing SparseMatrixPlus
 (shr_mct_sMatPInitnc) SmatP mapname /project/projectdirs/acme/inputdata/cpl/gridmaps/ne120np4/map_ne120np4_to_oRRS15to5_patch.160203.nc
 (shr_mct_sMatPInitnc) SmatP maptype X
 (shr_mct_sMatReaddnc) reading mapping matrix data decomposed...
 (shr_mct_sMatReaddnc) * file name                  : /project/projectdirs/acme/inputdata/cpl/gridmaps/ne120np4/map_ne120np4_to_oRRS15to5_patch.160203.nc
 (shr_mct_sMatReaddnc) * matrix dims src x dst      :     777602 x   5778136
 (shr_mct_sMatReaddnc) * number of non-zero elements:   92387502
 (shr_mct_sMatReaddnc) ... done reading file

I'll try another increase in compute nodes for the ocean, but if anyone has any other suggestions, I'd appreciate it. Note that I (personally) do not have this compset working anyplace yet. On Titan there is a failure in ice or ocean initialization, so earlier than this in the execution.

mt5555 commented 8 years ago

Using pure MPI does require more memory, and trying to fit 1/4 degree on this many nodes, memory is one of the concerns? Hence what about a x4 or x8 configuration? I think @amametjanov may have have experience with this on Mira (trying to find working configurations which use low thread counts but still fit into memory).

worleyph commented 8 years ago

Thanks. I increased to 5400x1 (noHT), with exactly the same error. I think that @ndkeen indicated that he had run an F case with this decomposition on Cori. I'll try even larger when I get the chance, but will also continue debugging at 5400x1 (since it seems to run at least once per day). @rljacob , latest information is from a debug statement added to initd_ in m_GlobalSegMap.F90. Here

0000: NGSEG = 0 0000: mGlobalSegMap::initp: non-positive value of ngseg error, stat =0

so ngseg is zero (and not negative) after coming out of the loop:

      ngseg = 0
      do i=0,npes-1
         ngseg = ngseg + counts(i)
         if(i == 0) then
            displs(i) = 0
         else
            displs(i) = displs(i-1) + counts(i-1)
         endif
      end do

I assume that this means that counts(:) == 0, but I'll verify as well.

jonbob commented 8 years ago

@worleyph The maps for ne120np4_oRRS15to5 are largely untested, just as a warning. And I ended up with a map for a different resolution that was bad but the mapping tools gave no warning or error creating it. So just a heads up...

worleyph commented 8 years ago

@jonbob, how would I determine whether this is the source of my problems? @rljacob , has anyone run this compset and grid resolution successfully?

jonbob commented 8 years ago

@worleyph : we might have to build up to the full A_WCYCL compset. If you don't figure out the problem, maybe first make sure atm/lnd compsets work, and ocn/ice as well. And then we could try all the active components together...

amametjanov commented 8 years ago

Yes, I'd increase thread counts and also increase pio stride; it looks like a re-arranger problem.

rljacob commented 8 years ago

No one has run this compset/resolution yet. @worleyph try just the F-case first on Titan. See https://acme-climate.atlassian.net/browse/CSG-163

worleyph commented 8 years ago

@rljacob - already ran an F case on Titan (successfully), a couple of weeks ago.

amametjanov commented 8 years ago

I'm also getting the same error on Mira:

5218: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
5218: ***.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
5218: Abort(2) on node 5218 (rank 5218 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 2) - process 5218

Tried 3 different PE layouts and PIO settings and all show the ngseg error.

What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.

rljacob commented 8 years ago

One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has anything to do with PIO.

worleyph commented 8 years ago

@amametjanov , since you are seeing this also, I assume that we can eliminate memory problems (if only because memory problems tend not to have the same signature of Mira and Cori).

@rljacob ,

One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has a a anything to do with PIO.

Does this imply a bad map then? Should I keep trying to debug this, or can this be addressed some other way? I've tracked it into the call to

 lsize_ = AttrVect_lsize(sMat%data)

where AttrVect_Isize has

   List_allocated(aV%iList) = .true.
   associated(aV%iAttr)) = .true.
   size(aV%iAttr,2) = 0

   List_allocated(aV%rList) = .true.
   associated(aV%rAttr)) = .true.
   size(aV%rAttr,2) = 0

I'm trying to work backwards from the sMat for this call, and is taking some time (waiting in the Cori queue).

rljacob commented 8 years ago

Yes it likely implies a bad map.

worleyph commented 8 years ago

And how do we figure this out? Is there a way to do this outside of running the model? It sounds like I am wasting my time continuing with my current approach.

rljacob commented 8 years ago

Actually from the cpl.log you pasted it read the basic parameters of the map correctly: (shr_mct_sMatReaddnc) * matrix dims src x dst : 777602 x 5778136 (shr_mct_sMatReaddnc) * number of non-zero elements: 92387502` But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.

worleyph commented 8 years ago

My latest debug writes made the ocn/ice init error on Titan disappear. It then died in the same location as I saw on Cori and @amametjanov saw on Mira. The Titan PE layout was 2700x4. So, this is persistent across architectures.

worleyph commented 8 years ago

But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.

What is a node here? lsize is zero for all processes for this map.

jonbob commented 8 years ago

@worleyph let me take another look at these maps -- we had another one that I made around the same time turn out to be bad -- despite getting no errors or warnings from the tools that generated them

worleyph commented 8 years ago

@jonbob, thanks. I'll keep poking, as a background activity.

douglasjacobsen commented 8 years ago

@amametjanov:

What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.

You can do:

-compset GMPAS -res T62_oQU120

To test ocn/ice only.

jonbob commented 8 years ago

@worleyph - I think at the very least we have a bad domain file for the ocean. I'll try to regenerate it and see if I can get something rational. In the meantime, I don't think there's any point to continued testing.

jonbob commented 8 years ago

@amametjanov - I know you're also trying to work on this resolution. I have not yet made any maps for the data models to oRRS15to5 -- so nothing like T62_oRRS15to5. I can do that if it would be helpful, but let me try to figure out this domain file issue first.

amametjanov commented 8 years ago

@jonbob, yeah, I'll wait for the resolution.

jonbob commented 8 years ago

@worleyph - Can you please try again, but using domain.ocn.ne120np4_oRRS15to5.160427.nc instead of the one from February? It's in the inputdata repo as of right now. Or I can experiment with these files, if you can help me know the processor configuration you were playing with.

worleyph commented 8 years ago

@jonbob, job is in the queue.

Correction - have to get it from the respository first. What is the full path name?

jonbob commented 8 years ago

Thanks @worleyph . I'll keep looking at the other ones, but this one was definitely bad -- the ocean mask was 1 everywhere. I have no idea how the gen_domain tool could spit that out, but it did....

worleyph commented 8 years ago

@jonbob , wasn't sufficient. Job died in the identical location with the identical error message.

jonbob commented 8 years ago

Thanks @worleyph - that domain file was definitely bad, so let me keep looking

jonbob commented 8 years ago

@worleyph - it still dies after reading the map_ne120np4_to_oRRS15to5_patch.160203.nc file? If you point me at a pe-layout, I can take on some of this testing. But my next sanity check would be to replace that patch file with the equivalent aave file and see if there's just an issue in that one map. I'll also try regenerating it, and maybe a bilin one as well...

worleyph commented 8 years ago

Look in

 /global/homes/w/worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne120_oRRS15_corip1_peexpt

at

 env_mach_pes.xml_failed
 env_mach_pes.xml_failed2
 env_mach_pes.xml_failed3
 env_mach_pes.xml

They get progressively larger, and all generate the same error, so you might as well start with the smallest one? If there are memory problems with this, it will at least occur after the current failure location, generating a different error message. If you have any doubts, you can also go with the largest one (env_mach_pes.xml).

jonbob commented 8 years ago

Thanks again, @worleyph - I'll let you know if I get anywhere

jonbob commented 8 years ago

@worleyph - OK, I tested using the aave map instead of the patch map and it got past the point your runs have been dying at -- so there's a problem with that file. I'll work on generating a bilin map to test as well. In the meantime, the aave map is valid but we shouldn't use it for science -- mostly it just allows us to the to the next issue. So now the end of the cpl log looks like:

(prep_ice_merge) x2i%Fioo_q = = o2x%Fioo_q (prep_ice_merge) x2i%Fioo_meltp = = o2x%Fioo_meltp (prep_ice_merge) x2i%Fioo_frazil = = o2x%Fioo_frazil (prep_ice_merge) x2i%Fixx_rofi = = (g2x%Figg_rofi + r2x%Firr_rofi)*flux_epbalfact

and the cesm log has this:

0000: MCT::mRouter::initp: GSMap indices not increasing...Will correct 0000: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0000: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0000: MCT::mRouter::initp: GSMap indices not increasing...Will correct srun: error: nid00620: task 734: Aborted

jonbob commented 8 years ago

I tried again and got a core file. The traceback looks like:

0 0x0000000003d9ac5b in raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42

1 0x0000000003e994f1 in abort () at abort.c:92

2 0x0000000003dc380e in for__signal_handler ()

3

4 0x0000000003d9ac5b in raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42

5 0x0000000003e994f1 in abort () at abort.c:92

6 0x00000000038fbaa2 in MPID_Abort ()

7 0x00000000038d3b71 in PMPI_Abort ()

8 0x00000000038d8005 in pmpi_abort__ ()

9 0x0000000002ad8319 in shr_mpi_mod_mp_shr_mpiabort ()

10 0x0000000002b2f818 in shr_sys_mod_mp_shr_sysabort ()

11 0x000000000167dd62 in prim_advection_mod_base_mp_verticalremap ()

12 0x00000000013d7082 in prim_driver_mod_mp_prim_runsubcycle ()

13 0x000000000109f14c in dyn_comp_mp_dynrun ()

14 0x0000000000e0e72b in stepon_mp_steponrun3 ()

15 0x00000000004d00a7 in cam_comp_mp_camrun3 ()

16 0x00000000004bb342 in atm_comp_mct_mp_atm_runmct ()

17 0x0000000000419fac in component_mod_mp_componentrun ()

18 0x0000000000404316 in cesm_comp_mod_mp_cesmrun ()

19 0x0000000000417838 in MAIN__ ()

20 0x00000000004012ce in main ()

jonbob commented 8 years ago

@worleyph I'll try building with your largest env_mach_pes file and see if it will get through the queues for a quick test overnight

worleyph commented 8 years ago

@jonbob, tell me what I need to know to duplicate your success (and new failure). I'll try it as well.

jonbob commented 8 years ago

@worleyph in the env_run file, I just replaced any instances of the patch mapping file with the corresponding aave map. I guess that one is obviously bad, so I'll make a bilinear version today and hope it's functional.

jonbob commented 8 years ago

@worleyph - I've generated two new mapping files for ne120np4=>oRRS15to5. I'll upload them and test them before making them public -- at least hope they can run past initialization and to the same point our current setup gets to... If they're OK, I'll upload them to the svn repo.

worleyph commented 8 years ago

@jonbob , may take awhile for me to jump back in - Titan is behaving poorly for me at the moment and I have some other tasks that I need to focus on. Thanks for the info, and please continue to lead the activity.

amametjanov commented 8 years ago

@jonbob, thanks for pointing to the new mapping files. Trying them out on Mira. Are you getting the same error as in the previous traceback; what's the path to your run-dir?

jonbob commented 8 years ago

@amametjanov My run directory is a bit of a mess -- I was testing different pe counts and both of the mapping files. But I'm working on cori at: /global/homes/j/jonbob/ACME/cime/scripts/SMS.ne120np4_oRRS15to5.A_WCYCL2000.corip1_intel.160427-110101 I'll try to check the permissions before I go today, so no guarantee you'll be able to see it all. I don't get a core file with every failure, so I'm also submitting a debug version of that same setup -- and I'll let you know what it points to.

amametjanov commented 8 years ago

The new mapping files appear to be working: the run went past the initialization where it was failing previously. It still failed with a seg-fault, but at call t_startf('remap_Q_ppm') in components/homme/src/share/vertremap_mod_base.F90:527 and this is probably due to out-of-memory issues. I'm looking into a working configuration.

jonbob commented 8 years ago

@amametjanov Any luck with runs over the weekend? I did get a corefile with one of my tests on cori, but pointing back to the atm as having the problem:

10 0x0000000002cc96b8 in shr_sys_mod_mp_shr_sysabort ()

11 0x00000000017142e2 in prim_advection_mod_base_mp_verticalremap ()

12 0x000000000144aef2 in prim_driver_mod_mp_prim_runsubcycle ()

13 0x00000000010e482c in dyn_comp_mp_dynrun ()

14 0x0000000000e2dcbb in stepon_mp_steponrun3 ()

15 0x00000000004da6f7 in cam_comp_mp_camrun3 ()

16 0x00000000004c1c02 in atm_comp_mct_mp_atm_runmct ()

17 0x0000000000419fac in component_mod_mp_componentrun ()

18 0x0000000000404316 in cesm_comp_mod_mp_cesmrun ()

19 0x0000000000417838 in MAIN__ ()

20 0x00000000004012ce in main ()

Does that make any sense to you?

amametjanov commented 8 years ago

Not yet, got another error at the same location with a different PE configuration on 2K nodes, trying on 4K nodes.

The stack-trace is similar to yours:

remap_q_ppm
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:527

remap1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:107

vertical_remap
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_advection_mod_base.F90:2142

prim_run_subcycle
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_driver_mod.F90:1507

__dyn_comp_NMOD_dyn_run$$OL$$1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:406

dyn_run
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:392
jonbob commented 8 years ago

@amametjanov : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I was using "next" from the repo, to pick up a fix to rtm...

worleyph commented 8 years ago

@jonbob, would you advise waiting until you get the scripts updated, or can you tell me how to repeat the experiment with the current master or next? Thanks.

jonbob commented 8 years ago

@worleyph : I can point you to my modifications on edison, or just list the namelist changes and pe-layout, whichever is easier. And depending on whether or not you intend to work over this holiday weekend.

singhbalwinder commented 8 years ago

I tried to run this case on Cori and found a bug which is fixed in #903. I tried running it again on EOS with the bug fix and ran out of time. I have resubmitted it again on Cori and EOS to see if it runs there (debug flags with Intel compiler). @jonbob : Are you using the code post #903 fix?

jonbob commented 8 years ago

@singhbalwinder : yes, I was running with next from yesterday

worleyph commented 8 years ago

@jonbob , I'll wait until next week. I'll bother you again then. Thanks.

jonbob commented 8 years ago

@worleyph sounds good -- I hope that means you're getting a real holiday weekend. I'm going to keep pushing a little, at least get it to run a 5-day smoke test successfully on a couple of different platforms.

worleyph commented 8 years ago

@jonbob: "I hope that means you're getting a real holiday weekend."

H'mm - has my spouse been talking to you? :-). Thanks for continuning to push this.