E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
344 stars 351 forks source link

Edison: coupled simulation died when creating restart files #854

Closed tangq closed 8 years ago

tangq commented 8 years ago

In the last a couple of weeks, about 70% of my coupled simulations on edison died in the middle of creating restart files (some components created restart files successfully). This failure seems occurring randomly. I only see it for coupled simulations, not for atmosphere-only ones. Plus, the model doesn't fail at the some place when rerunning the identical simulation, suggesting the cause of this issue might be I/O related.

Two examples (script for reproducing the first example is attached) run_acme.alpha_20160419.A_WCYCL2000.edison.csh.txt

$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts $rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run

$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts $rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run

golaz commented 8 years ago

Edison, the gift that keeps on giving...

mt5555 commented 8 years ago

From your run script, it doesn't look like you are setting the PIO_NUMIOTASKS. Which might mean you are using the defaults, which usually try to use many more IO tasks than is helpful, which can make the run more unstable.
So one thing to check would be that the number of I/O tasks is reasonable, say 32 or 64.

jayeshkrishna commented 8 years ago

From the cpl log I see that the number of pio io tasks is 40 (env_run.xml sets stride to 24).

jayeshkrishna commented 8 years ago

I have seen similar errors when I have left OCN threaded (which I assume is not your case)

jayeshkrishna commented 8 years ago

Does the same case succeed (run successfully without failures) in the rest (30%) of the cases? Looking at your runs above it looks like not all runs have the same configuration (So my question is, has the same case/configuration run successfully for you? Has the success rate improved after last week?).

mt5555 commented 8 years ago

Since this is I/O related, I might try PIO_NUMIOTASKS much smaller, around 8 or so. For 1/4 degree, I think 128 is a good number, so for 1 degree, the files are 16x smaller, suggesting 8 might be reasonable.

tangq commented 8 years ago

@jayeshkrishna , sometimes the runs from same configuration are successful. I only did one test (the first case above), so don't have a sampling size large enough to say if this problem improved this week. But this first try failed.

@mt5555 , I'll try PIO_NUMIOTASKS=8 for the same case.

tangq commented 8 years ago

@mt5555 , I don't see PIO_NUMIOTASKS in env_run.xml. Do you mean PIO_NUMTASKS?

mt5555 commented 8 years ago

Yes. sorry about that.

in the ACME v0 days, if you set PIO_NUMTASKS w/o also setting PIO_STRIDE, the PIO_NUMTASKS value would get reset back to the defaults. So you'll have to check the log files to make sure it is really working.

tangq commented 8 years ago

So, what settings do I need to change? PIO_NUMTASKS=8? Or need to change PIO_STRIDE as well? I am not familiar with those and usually don't change them.

jayeshkrishna commented 8 years ago

Please change both (Note that the stride is preserved even if the number of io tasks is wrong for a component)

worleyph commented 8 years ago

@jayeshkrishna , can @tangq just set PIO_STRIDE to -1, and let the PIO_NUMTASKS determine stride? (Is that how -1 works in this case?)

mt5555 commented 8 years ago

Hopefully this is fixed, but in ACME v0 that would only work if all components used the same number of MPI tasks. If a component used less than the total available MPI tasks, a bug in the logic would force the component to reset back to the system defaults for stride and numtasks.

jayeshkrishna commented 8 years ago

Setting PIO_NUMTASKS only should work (supposed to) and if it doesn't I can fix it. AFAICR, pio resets the settings if the PIO_NUMTASKS+PIO_STRIDE is not valid for a component.

worleyph commented 8 years ago

@jayeshkrishna , my concern is that the default layout is setting stride to 24, not to -1. Does @tangq need to reset stride to -1?

jayeshkrishna commented 8 years ago

@tangq, please set PIO_STRIDE=-1 and PIO_NUMTASKS=8 and check the logs to make sure that it is working as you expect. If not, I will fix it for you (I have always set both the tasks and stride when I try my experiments, just old habit). This should work (no pio reset to default settings) as long as all components have at least 8 MPI tasks.

mt5555 commented 8 years ago

@jayeshkrishna : hopefully this is fixed now, but for the record, in the CESM1.2 days, with PIO_NUMTASKS=8 and PIO_STRIDE=-1, then PIO_STRIDE would be computed for all components based on the total number of MPI tasks, not the number of MPI tasks used by the component, causing bad values if a component was running on a subset of the MPI tasks.

jayeshkrishna commented 8 years ago

This behaviour is definitely a bug and I can try some cases later today to try to reproduce it. I haven't changed this logic, so if the bug existed in CESM 1.2 it might still be there in the current version of code (@tangq : you might be better off setting both the stride and numtasks like I do for my runs since Mark has encountered a bug after setting only the numtasks for his cases with an older version of code).

worleyph commented 8 years ago

@jayeshkrishna , should @tangq set both stride and numtasks for each component separately, or will the global settings be sufficient?

jayeshkrishna commented 8 years ago

The global settings should work (if he has all the components running with the same number of tasks, or else the stride would be preserved).

tangq commented 8 years ago

I used the same case: $casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts $rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run

change both PIO_NUMTASKS=8 and PIO_STRIDE=8 for a 5-day test run writing restart files every day. Same problem occurred on the 4th day, but the restart files were created successfully for the 2nd and 3rd days.

jayeshkrishna commented 8 years ago

I am looking into this, this could be a bug in PIO or OCN (The last time I saw a similar issue OCN was multithreaded)

tangq commented 8 years ago

Thanks @jayeshkrishna , I also noticed that the model can die when different modules are creating restart. In the latest test, only one restart file is created, but in the second case on the top of this page, more than one restart files are created. This may imply more than one module causes the problem.

jayeshkrishna commented 8 years ago

@tangq : I am currently trying to recreate this issue. Do you have a list of commands to setup the case (I looked at the attached script but it is too long and I would like to set it up manually)? I followed the steps below to run my case (and this case ran successfully - did not crash),

worleyph commented 8 years ago

@jayeshkrishna , will be you using scratch2? Looks like @tangq is using scratch2 for the run directory. You'd hope that this would not matter, but this is a variable that should be examined.

jayeshkrishna commented 8 years ago

Thanks @worleyph , my $SCRATCH points to scratch2 (run dir is in scratch2).

mt5555 commented 8 years ago

@jayeshkrishna : NUMTAKS=8 was something I suggested, on the (outdated) experience that using fewer I/O nodes is more reliable. NUMTASKS=8 did seem more reliable (25% failure rate instead of 70%), but you might want the higher values to get more failures to debug :-)

jayeshkrishna commented 8 years ago

Thanks. Yes, I want to start with as close a config as @tangq before changing up the case details.

tangq commented 8 years ago

@jayeshkrishna , do you want to use the same PE layout as my run? If so, you'll need "-pecount S" when creating the case. I was not aware of bug #861 , so didn't change CLM_BLDNML_OPTS. Except for these two changes, you should be able to reproduce my run with the steps listed above.

jayeshkrishna commented 8 years ago

Thanks, the bug that I mentioned ( #861 ) seems to have been introduced after the tag that you are working on (so would not have affected you).

bishtgautam commented 8 years ago

861 would cause the model to crash during land initialization. At this point, I don't know of any valid compset + resolution + existing land ic file, where #861 manifests itself.

jayeshkrishna commented 8 years ago

@tanq : Another run with "v1.0.0-alpha.4-67-gd02c96d" and "-pecount S" ran successfully. Can you try running a case manually (without using the scripts - using the steps I mentioned above + "pecount S") with v1.0.0-alpha.4-67-gd02c96d (or the latest master) and see if you get the same error?

jayeshkrishna commented 8 years ago

@tangq : For the first case (20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00) provided in the issue description the most likely cause of the failure is an error in OCN. Below is the last line in /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run/ocn.log.160403-220048

ERROR: mpas_get_timeInterval cannnot return time interval information for an in
 terval containing months and years without a startTimeIn argument.

However for the latest case (20160419.A_WCYCL2000.ne30_oEC.edison) I don't see that error in the OCN logs. This case appears to have failed while doing timestep 0001-01-03_23:30:00 (last line in /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run/ocn.log.160421-112443). From the logs it looks like OCN had to perform 3 more timesteps before doing I/O.

Initially I felt the error could be in PIO because both the fail cesm logs have pio debug statements (OCN trying to write an array of characters - possibly xtime) towards the end of the file and that it possibly happens during restart writes. But I am not sure of it anymore.

@douglasjacobsen : Please take a look at this issue and let me know your comments.

tangq commented 8 years ago

@jayeshkrishna , it seems that the version "v1.0.0-alpha.4-67-gd02c96d" does NOT have the restart problem.

I did two tests writing out daily restart files (one using the single run script, the other using manual steps). Both tests successfully wrote 20 restarts without a single failure. So, this probably means that the restart problem doesn't exist in this code version. The run script does not cause the problem. Some code change between "v1.0.0-alpha.4-67-gd02c96d" and "v1.0.0-alpha.4-52-g8c0a13d" fixes it.

douglasjacobsen commented 8 years ago

@jayeshkrishna That error is typically non-fatal. Most of the time you can safely ignore it.

PeterCaldwell commented 8 years ago

@tangq just told me that the restart issue has re-appeared in the most recent v1.0.0-alpha.5 tag checked out 2016/04/28. His most recent coupled simulations (which bundle both 1850 and 2000 conditions) failed to write restarts for ocn and cpl during year 34, then failed again in the same place when he restarted the simulations. He has a 3rd attempt to restart waiting in the queue. Note that he completed 34 yrs of simulation with restarts written every year at first and more recently every 6 months before this issues showed up... so our choices are that edison was having a bout of good health that did not last, this is a random event with low probability of occurrence, or the issue is related to the model state. In any case, Qi's statement above that this problem has been fixed is incorrect.

Information about these runs is available at https://acme-climate.atlassian.net/wiki/display/SIM/20160428.A_WCYCL2000.ne30_oEC.edison.alpha5_00 for year 2000 and https://acme-climate.atlassian.net/wiki/display/SIM/20160428.A_WCYCL1850.ne30_oEC.edison.alpha5_00 for 1850. Note that Qi did not set PIO_NUMTASKS in the run script for either of these simulations and cpl logs show a whopping 200 pio tasks for some components and 225 for others. Should we always be setting PIO_NUMTASKS for our runs? Is there a rule of thumb to use for this?

jayeshkrishna commented 8 years ago

@PeterCaldwell : A PIO_NUMTASKS of 200+ seems to be high, let me look at the case and get back to you. I thought the scripts (tangq's scripts) set the tasks and strides to a lower value. We can add specific rules in config_pes for this case, if needed (So it runs out of the box with a specific number of io tasks). AFAICT, there is no clear rule of thumb for setting these PIO parameters for an entire simuln (try various options and see what works for the simulation as a whole) but keeping the number of io tasks <=100 has worked in the past. Let me see the case files before giving you more information on setting the PIO parameters. @ndkeen : An Edison intermittent I/O issue that might be of interest to you.

ndkeen commented 8 years ago

I can't seem to find an error message? How do we know something went wrong?

PeterCaldwell commented 8 years ago

"How do we know something went wrong?" - sorry, I forgot to mention that Qi said the runs stalled out for several hours on the rewrite step and he killed the jobs. That's why there are no error messages. I should also mention that I'm reporting this issue on behalf of Qi because Qi is at a conference with bad wifi and is leaving later today(?) for China. Because I'm the messenger rather than the actual complainant, my ability to answer questions may be limited...

jayeshkrishna commented 8 years ago

@PeterCaldwell : Do you have the path to the case directory (Is it /scratch2/scratchdirs/tang30/ACME_simulations/20160428.A_WCYCL2000.ne30_oEC.edison.alpha5_00/case_scripts/)?

ndkeen commented 8 years ago

If I could just have instructions on how to repeat. Preferably the simplest/smallest/shortest problem that exhibits the failure.

PeterCaldwell commented 8 years ago

@jayeshkrishna - yes, you have the right case directory. Sorry that the webpages refer to the "short-term archive" location when they mean the root work directory.

PeterCaldwell commented 8 years ago

@ndkeen - I suspect the essential ingredients of a test case would be:

  1. a coupled run rather than an atmos-only run
  2. writing restarts frequently to maximize opportunities for failure
  3. doing heavy IO at the same time that other codes are also doing lots of IO.

I've attached a version of Qi's run script which I modified to only run 20 days and set to write restarts and atm history files every day. I set it to run in the debug queue (grep for debug_queue). I don't know how to change the frequency of history files for other components, but making them daily might help as well. Changing the frequency of writing initial condition files to daily might also be helpful, but I also don't know how to do that off the top of my head. I'm currently building this test case so I can't promise it works yet.

Things to tinker with: Can you reduce the frequency of failure by decreasing PIO_NUMTASKS? Can you increase the probability of failure by running several copies of this code concurrently on nodes which share PIO bandwidth?

run_acme.alpha_20160505.A_WCYCL2000.edison.RestartTest.csh.txt

ndkeen commented 8 years ago

I ran the script that Peter provided. I didn't change anything. It ran for 20 days on edison. How to tell if something went wrong? Here is the per-day timestamp and each days dt looks reasonable compared to the others. ?

 tStamp_write: model date =    10102       0 wall clock = 2016-05-05 15:22:03 avg dt =    97.75 dt =    97.75
 tStamp_write: model date =    10103       0 wall clock = 2016-05-05 15:23:15 avg dt =    84.73 dt =    71.71
 tStamp_write: model date =    10104       0 wall clock = 2016-05-05 15:24:28 avg dt =    80.79 dt =    72.91
 tStamp_write: model date =    10105       0 wall clock = 2016-05-05 15:25:44 avg dt =    79.73 dt =    76.52
 tStamp_write: model date =    10106       0 wall clock = 2016-05-05 15:26:59 avg dt =    78.64 dt =    74.28
 tStamp_write: model date =    10107       0 wall clock = 2016-05-05 15:28:10 avg dt =    77.39 dt =    71.18
 tStamp_write: model date =    10108       0 wall clock = 2016-05-05 15:29:24 avg dt =    76.96 dt =    74.33
 tStamp_write: model date =    10109       0 wall clock = 2016-05-05 15:30:41 avg dt =    76.97 dt =    77.06
 tStamp_write: model date =    10110       0 wall clock = 2016-05-05 15:31:54 avg dt =    76.53 dt =    73.01
 tStamp_write: model date =    10111       0 wall clock = 2016-05-05 15:33:12 avg dt =    76.62 dt =    77.47
 tStamp_write: model date =    10112       0 wall clock = 2016-05-05 15:34:25 avg dt =    76.31 dt =    73.19
 tStamp_write: model date =    10113       0 wall clock = 2016-05-05 15:35:39 avg dt =    76.12 dt =    74.04
 tStamp_write: model date =    10114       0 wall clock = 2016-05-05 15:36:57 avg dt =    76.31 dt =    78.61
 tStamp_write: model date =    10115       0 wall clock = 2016-05-05 15:38:16 avg dt =    76.48 dt =    78.70
 tStamp_write: model date =    10116       0 wall clock = 2016-05-05 15:39:34 avg dt =    76.55 dt =    77.55
 tStamp_write: model date =    10117       0 wall clock = 2016-05-05 15:40:48 avg dt =    76.40 dt =    74.14
 tStamp_write: model date =    10118       0 wall clock = 2016-05-05 15:42:01 avg dt =    76.21 dt =    73.11
 tStamp_write: model date =    10119       0 wall clock = 2016-05-05 15:43:17 avg dt =    76.22 dt =    76.33
 tStamp_write: model date =    10120       0 wall clock = 2016-05-05 15:44:39 avg dt =    76.48 dt =    81.24
 tStamp_write: model date =    10121       0 wall clock = 2016-05-05 15:45:57 avg dt =    76.56 dt =    78.08
PeterCaldwell commented 8 years ago

You would know if the issue we are trying to debug was occurring because the run would stall out for hours during one of these write steps. You didn't see such behavior, so you didn't capture the problem.

Remember that Qi's run had written O(80) restart files before any bad behavior showed up... so it is unsurprising that this first test case failed to reproduce the problem. This problem is stochastic in nature, so it is hard to say how long you need to run before you reproduce it. I chose a 20 day simulation in order to make use of the debug queue, but perhaps you should try running for a few hundred days (perhaps by using the auto-resubmit features to chain together a bunch of short runs in the debug queue)? You could also try increasing the number of pio tasks since folks have the hypothesis that having too many tasks trying to write could be a source of trouble? Creating a situation where several instances of this code are writing restart files all at the same time may also accelerate your time-to-failure since another hypothesis is that heavy IO usage by other simultaneously-running codes is causing this slowness.

Debugging stochastic problems is annoying...

bishtgautam commented 8 years ago

A while back, few of my simulations for land only compsets were running out of memory when restart files were written frequently. This was due to a memory leak. The issue was reported to NCAR (see http://bugs.cgd.ucar.edu/show_bug.cgi?id=2029) and was fixed in the land model. But, maybe the same memory leak is occurring in other components.

Relevant comment from @billsacks on the bug discussion:

From the mct documentation for OrderedPoints:

"Points is allocated here. The calling process is responsible for deallocating the space."

tangq commented 8 years ago

My last job submission failed at the first time of writing restart. It also stalled at the ocn and drv components (the restart files of these two modules were not created). @ndkeen - one way to increase the chance of producing this problem is to bundle more runs together. In my run, the 1850 and 2000 runs are bundled, so they write restart at the same time (doubled I/O).

@PeterCaldwell - before this problem is fixed, we can set the restart frequencies and the run lengths differently for the two runs, so they do not write restart at the same time to minimize the probability of this problem. For example, set one 14-year run writing restart every 7 years and the other one 15-year run, 5-year restart. Then swap the settings for the two cases in the next submission. (Have to waste some allocation to avoid writing restart together at the end of the runs). Hopefully, this can help extend the runs further.

PeterCaldwell commented 8 years ago

@tangq 's last note about restarts failing in the same spot 3x in a row makes me think this isn't a stochastic problem and is instead due to the model falling into a bad state from which it can't write restarts. Unfortunately, the failures are in ocn and drv and I'm not an expert with these components. Some tasks which might help get to the bottom of this:

  1. Could we get someone from ocn and cpl teams to look at whatever output we have from the model in the months before this error and assure us the model hasn't gone into a weird state?
  2. @tangq - can you restart the model one last time, but instead of writing restarts every 6 mo, try to write the restart every day? My hope(?) is that the model will fail on this restart too, which would provide @ndkeen with an efficient test case to tinker with. I'd suggest starting by just running the yr 2000 case instead of bundling 1850 and 2000 in order to reduce confusion and core hours. If that case doesn't fail, then try again bundled.
  3. @ndkeen - Can you copy over the files from Qi's run and try to run it yourself (with restart frequency increased to daily)?
ndkeen commented 8 years ago

I don't know how to take a case directory and run this myself. Do you have a create_newcase command I can try? Apparently there is a create_clone command, but I've not tried it and not sure how to tell if it's doing what we want.

jayeshkrishna commented 8 years ago

create_clone is supposed (also created for) to work for your case. If it does not we need to fix it.