Closed worleyph closed 7 years ago
Using pure MPI does require more memory, and trying to fit 1/4 degree on this many nodes, memory is one of the concerns? Hence what about a x4 or x8 configuration? I think @amametjanov may have have experience with this on Mira (trying to find working configurations which use low thread counts but still fit into memory).
Thanks. I increased to 5400x1 (noHT), with exactly the same error. I think that @ndkeen indicated that he had run an F case with this decomposition on Cori. I'll try even larger when I get the chance, but will also continue debugging at 5400x1 (since it seems to run at least once per day). @rljacob , latest information is from a debug statement added to initd_ in m_GlobalSegMap.F90. Here
0000: NGSEG = 0 0000: mGlobalSegMap::initp: non-positive value of ngseg error, stat =0
so ngseg is zero (and not negative) after coming out of the loop:
ngseg = 0
do i=0,npes-1
ngseg = ngseg + counts(i)
if(i == 0) then
displs(i) = 0
else
displs(i) = displs(i-1) + counts(i-1)
endif
end do
I assume that this means that counts(:) == 0, but I'll verify as well.
@worleyph The maps for ne120np4_oRRS15to5 are largely untested, just as a warning. And I ended up with a map for a different resolution that was bad but the mapping tools gave no warning or error creating it. So just a heads up...
@jonbob, how would I determine whether this is the source of my problems? @rljacob , has anyone run this compset and grid resolution successfully?
@worleyph : we might have to build up to the full A_WCYCL compset. If you don't figure out the problem, maybe first make sure atm/lnd compsets work, and ocn/ice as well. And then we could try all the active components together...
Yes, I'd increase thread counts and also increase pio stride; it looks like a re-arranger problem.
No one has run this compset/resolution yet. @worleyph try just the F-case first on Titan. See https://acme-climate.atlassian.net/browse/CSG-163
@rljacob - already ran an F case on Titan (successfully), a couple of weeks ago.
I'm also getting the same error on Mira:
5218: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
5218: ***.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
5218: Abort(2) on node 5218 (rank 5218 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 2) - process 5218
Tried 3 different PE layouts and PIO settings and all show the ngseg
error.
What is the ocn/ice only -compset
and -res
? Need to rule out inputdata problems.
One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has anything to do with PIO.
@amametjanov , since you are seeing this also, I assume that we can eliminate memory problems (if only because memory problems tend not to have the same signature of Mira and Cori).
@rljacob ,
One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has a a anything to do with PIO.
Does this imply a bad map then? Should I keep trying to debug this, or can this be addressed some other way? I've tracked it into the call to
lsize_ = AttrVect_lsize(sMat%data)
where AttrVect_Isize has
List_allocated(aV%iList) = .true.
associated(aV%iAttr)) = .true.
size(aV%iAttr,2) = 0
List_allocated(aV%rList) = .true.
associated(aV%rAttr)) = .true.
size(aV%rAttr,2) = 0
I'm trying to work backwards from the sMat for this call, and is taking some time (waiting in the Cori queue).
Yes it likely implies a bad map.
And how do we figure this out? Is there a way to do this outside of running the model? It sounds like I am wasting my time continuing with my current approach.
Actually from the cpl.log you pasted it read the basic parameters of the map correctly:
(shr_mct_sMatReaddnc)
* matrix dims src x dst : 777602 x 5778136
(shr_mct_sMatReaddnc) * number of non-zero elements: 92387502`
But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.
My latest debug writes made the ocn/ice init error on Titan disappear. It then died in the same location as I saw on Cori and @amametjanov saw on Mira. The Titan PE layout was 2700x4. So, this is persistent across architectures.
But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.
What is a node here? lsize is zero for all processes for this map.
@worleyph let me take another look at these maps -- we had another one that I made around the same time turn out to be bad -- despite getting no errors or warnings from the tools that generated them
@jonbob, thanks. I'll keep poking, as a background activity.
@amametjanov:
What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.
You can do:
-compset GMPAS -res T62_oQU120
To test ocn/ice only.
@worleyph - I think at the very least we have a bad domain file for the ocean. I'll try to regenerate it and see if I can get something rational. In the meantime, I don't think there's any point to continued testing.
@amametjanov - I know you're also trying to work on this resolution. I have not yet made any maps for the data models to oRRS15to5 -- so nothing like T62_oRRS15to5. I can do that if it would be helpful, but let me try to figure out this domain file issue first.
@jonbob, yeah, I'll wait for the resolution.
@worleyph - Can you please try again, but using domain.ocn.ne120np4_oRRS15to5.160427.nc instead of the one from February? It's in the inputdata repo as of right now. Or I can experiment with these files, if you can help me know the processor configuration you were playing with.
@jonbob, job is in the queue.
Correction - have to get it from the respository first. What is the full path name?
Thanks @worleyph . I'll keep looking at the other ones, but this one was definitely bad -- the ocean mask was 1 everywhere. I have no idea how the gen_domain tool could spit that out, but it did....
@jonbob , wasn't sufficient. Job died in the identical location with the identical error message.
Thanks @worleyph - that domain file was definitely bad, so let me keep looking
@worleyph - it still dies after reading the map_ne120np4_to_oRRS15to5_patch.160203.nc file? If you point me at a pe-layout, I can take on some of this testing. But my next sanity check would be to replace that patch file with the equivalent aave file and see if there's just an issue in that one map. I'll also try regenerating it, and maybe a bilin one as well...
Look in
/global/homes/w/worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne120_oRRS15_corip1_peexpt
at
env_mach_pes.xml_failed
env_mach_pes.xml_failed2
env_mach_pes.xml_failed3
env_mach_pes.xml
They get progressively larger, and all generate the same error, so you might as well start with the smallest one? If there are memory problems with this, it will at least occur after the current failure location, generating a different error message. If you have any doubts, you can also go with the largest one (env_mach_pes.xml).
Thanks again, @worleyph - I'll let you know if I get anywhere
@worleyph - OK, I tested using the aave map instead of the patch map and it got past the point your runs have been dying at -- so there's a problem with that file. I'll work on generating a bilin map to test as well. In the meantime, the aave map is valid but we shouldn't use it for science -- mostly it just allows us to the to the next issue. So now the end of the cpl log looks like:
(prep_ice_merge) x2i%Fioo_q = = o2x%Fioo_q (prep_ice_merge) x2i%Fioo_meltp = = o2x%Fioo_meltp (prep_ice_merge) x2i%Fioo_frazil = = o2x%Fioo_frazil (prep_ice_merge) x2i%Fixx_rofi = = (g2x%Figg_rofi + r2x%Firr_rofi)*flux_epbalfact
and the cesm log has this:
0000: MCT::mRouter::initp: GSMap indices not increasing...Will correct 0000: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0000: MCT::mRouter::initp: RGSMap indices not increasing...Will correct 0000: MCT::mRouter::initp: GSMap indices not increasing...Will correct srun: error: nid00620: task 734: Aborted
I tried again and got a core file. The traceback looks like:
@worleyph I'll try building with your largest env_mach_pes file and see if it will get through the queues for a quick test overnight
@jonbob, tell me what I need to know to duplicate your success (and new failure). I'll try it as well.
@worleyph in the env_run file, I just replaced any instances of the patch mapping file with the corresponding aave map. I guess that one is obviously bad, so I'll make a bilinear version today and hope it's functional.
@worleyph - I've generated two new mapping files for ne120np4=>oRRS15to5. I'll upload them and test them before making them public -- at least hope they can run past initialization and to the same point our current setup gets to... If they're OK, I'll upload them to the svn repo.
@jonbob , may take awhile for me to jump back in - Titan is behaving poorly for me at the moment and I have some other tasks that I need to focus on. Thanks for the info, and please continue to lead the activity.
@jonbob, thanks for pointing to the new mapping files. Trying them out on Mira. Are you getting the same error as in the previous traceback; what's the path to your run-dir?
@amametjanov My run directory is a bit of a mess -- I was testing different pe counts and both of the mapping files. But I'm working on cori at: /global/homes/j/jonbob/ACME/cime/scripts/SMS.ne120np4_oRRS15to5.A_WCYCL2000.corip1_intel.160427-110101 I'll try to check the permissions before I go today, so no guarantee you'll be able to see it all. I don't get a core file with every failure, so I'm also submitting a debug version of that same setup -- and I'll let you know what it points to.
The new mapping files appear to be working: the run went past the initialization where it was failing previously.
It still failed with a seg-fault, but at call t_startf('remap_Q_ppm')
in components/homme/src/share/vertremap_mod_base.F90:527 and this is probably due to out-of-memory issues. I'm looking into a working configuration.
@amametjanov Any luck with runs over the weekend? I did get a corefile with one of my tests on cori, but pointing back to the atm as having the problem:
Does that make any sense to you?
Not yet, got another error at the same location with a different PE configuration on 2K nodes, trying on 4K nodes.
The stack-trace is similar to yours:
remap_q_ppm
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:527
remap1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:107
vertical_remap
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_advection_mod_base.F90:2142
prim_run_subcycle
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_driver_mod.F90:1507
__dyn_comp_NMOD_dyn_run$$OL$$1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:406
dyn_run
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:392
@amametjanov : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I was using "next" from the repo, to pick up a fix to rtm...
@jonbob, would you advise waiting until you get the scripts updated, or can you tell me how to repeat the experiment with the current master or next? Thanks.
@worleyph : I can point you to my modifications on edison, or just list the namelist changes and pe-layout, whichever is easier. And depending on whether or not you intend to work over this holiday weekend.
I tried to run this case on Cori and found a bug which is fixed in #903. I tried running it again on EOS with the bug fix and ran out of time. I have resubmitted it again on Cori and EOS to see if it runs there (debug flags with Intel compiler). @jonbob : Are you using the code post #903 fix?
@singhbalwinder : yes, I was running with next from yesterday
@jonbob , I'll wait until next week. I'll bother you again then. Thanks.
@worleyph sounds good -- I hope that means you're getting a real holiday weekend. I'm going to keep pushing a little, at least get it to run a 5-day smoke test successfully on a couple of different platforms.
@jonbob: "I hope that means you're getting a real holiday weekend."
H'mm - has my spouse been talking to you? :-). Thanks for continuning to push this.
I've been trying to find feasible PE layouts for
on Cori. I started with a small (1024x1, stacked, noHT) layout, which failed. I then tried 2048x1, stacked, noHT), and most recently 3600x1 for atmosphere, coupler, and land (3616x1) with the other components on their own compute nodes using a 2048x1 decomposition. Again this is all noHT:
I am getting the identical error for all three of these. From cesm.log:
cpl.log ends with
I'll try another increase in compute nodes for the ocean, but if anyone has any other suggestions, I'd appreciate it. Note that I (personally) do not have this compset working anyplace yet. On Titan there is a failure in ice or ocean initialization, so earlier than this in the execution.