worleyph commented 8 years ago

This, hopefully, is a transitory issue. However, the current discussion is buried across a few different Confluence pages and in e-mails with NERSC help (though we have yet to receive a response from NERSC).

Others have seen different, but all bad, performance issues. The initial indication was that I/O was very slow (which has happened on Edison before), but I am seeing everything being very slow.

Examples:

-compset FC5AV1C-01 -res ne30_oEC

a) 1 day for a 58 node PE layout (April 2):

  CPL:INIT                          40.992325
  CPL:RUN_LOOP                      53.270420
    CPL:ATM_RUN  -      48    -     44.915367     2.218216     0.782414

then today (April 5)

  CPL:INIT                          985.659424
  CPL:RUN_LOOP                     1360.697510
    CPL:ATM_RUN  -      48    -     988.505676    37.936245    12.435209

b) 2 days for a 684 node PE layout (April 3)

  CPL:INIT                         149.158356
  CPL:RUN_LOOP                      27.896242
    CPL:ATM_RUN  -      96    -     22.574593     2.544883     0.158755

then today

  CPL:INIT                           786.385925
  CPL:RUN_LOOP                      2449.291016
    CPL:ATM_RUN  -      96    -     1506.443604    39.947269     2.608297

This is very persistent (and seems to be getting worse over the last 2 days, if anything). I can provide more details, and will look through the data to see if anything pops up of interest. In any case, Edison is so slow now that it is practically unusable at the moment.

ndkeen commented 8 years ago

I've tried several more jobs (longer running [~50 days], smaller number of cores, different version of the code, with/without profiling) and still can't seem to get a "slow day".

worleyph commented 8 years ago

And I can't get a fast day. @ndkeen , are you using master? Where are you building? Where are you running? Please point me to one of your "fast" cases and I will try to duplicate exactly what you did. My well-defined slow/fast dates could imply that something changed in master, but @goalz also tried with an old executable, I believe, and did not recover "fast" performance. (I don't see any commits to master around that time that would make sense, but that is not conclusive.)

worleyph commented 8 years ago

I repeated one of @ndkeen 's fast cases - same file system for source, build, and run, fresh checkout, his pe layout, etc. and was still 10 times slower. I then did a diff of the case directories, and the only differences were that I used the debug partition (deliberate, and hopefully innocuous) and I had

7) darshan/2.3.1

loaded, and he did not. This was not deliberate on my part, and seems to be the default. I am rerunning with darshan unloaded (not recompiling yet), and then will recompile without darshan. I don't remember when darshan "happens".

ndkeen commented 8 years ago

Yes, I considered this last night as well and ran an experiment WITH darshan loaded, but did not see any differences in the performance. So I don't want to suggest someone try an experiment unless I can verify it is important. The darshan module is collecting IO data when it is loaded. It collects the data for general stats used by NERSC. I always turn it off as it can only slow things down. If someone wanted to try, you can simply unload it from your env and re-run. Check software_environment.txt to verify it is gone.

It could still be that I'm simply not running long enough. Last night I submitted a job to run for 500 days and increased the walltime, but it is still in Q.

Pat, can you point me to the directory where you ran?

worleyph commented 8 years ago

My runs are all 5 day or less, so length of run is not the issue. My "slowness" shows up immediately, and without fail.

@ndkeen, you also had some modules loaded that I did not, but most (all?) of these appear irrelevant.

3) eswrap/1.1.0-1.020200.1130.0 8) git/2.4.6 9) mercurial/3.2.4 10) ncl/6.1.1 11) python_base/2.7.9 12) numpy/1.9.2 13) scipy/0.15.1 14) matplotlib/1.4.3 15) ipython/3.1.0 16) python/2.7.9

ndkeen commented 8 years ago

Hmm, well looking at the data from Golaz, I see that there are many days that are "normal" performance, then there can sharp slow days. It would be good to make sure we are looking at the same perf data to indicate when it is slow or not. I wanted to see the tStamp lines in your cpl.log file.

tangq commented 8 years ago

@ndkeen , @golaz , @worleyph , I changed the max number of tasks per node to 24 for my slow simulations (153 nodes) shown in the first figure above. This run is fast, completed 3 years and 1 month in 14.5 hours - that is 5.1 SYPD. A throughput of 5.8 SYPD is expected with MAX_TASKS_PER_NODE=48 (https://acme-climate.atlassian.net/wiki/display/CH/PE+layouts+for+faster+throughput+with+low-res+v1+alpha+coupled).

It is still running. Let's see how far it goes.

$casedir: /global/homes/t/tang30/scratch2/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00H/case_scripts

$rundir: /global/homes/t/tang30/scratch2/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00H/run

worleyph commented 8 years ago

I changed the max number of tasks per node to 24

I've been running both ways (24 and 48), and haven't noticed much of a difference in slow vs. fast.

tangq commented 8 years ago

@worleyph , does that mean that MAX_TASKS_PER_NODE doesn't impact hyper-threading? Otherwise, we should see difference.

Is my fast run just being lucky?

worleyph commented 8 years ago

does that mean that MAX_TASKS_PER_NODE doesn't impact hyper-threading?

It does impact hyper-threading, and there is a difference, but at most a factor of two - not this factor of ten or more that defines slow vs. fast runs. I think that you are just being lucky. In any case, both with and with hyperthreading were fast before sometime starting this weekend (in my experience - I don't know when slow runs first started appearing for @golaz or @cameronsmith1 or you or ...).

ndkeen commented 8 years ago

I see that Golaz reported above that this 48->24 change did not impact his results, and it also seems like Pat has tried. I have made all of my runs with 24 as this is what we are going to (checked into next now).

I think it is odd that Pat seems to see slow days immediately and every time, while others do not.

What is the smallest size (and num cores) that has been run that demonstrates the problem? We could try running a couple of 24-hour runs with small sizes. If they ran at the same time, then we could make a plot that shows exactly when slowdowns are happening. If a slowdown occurs at the 7:05pm for both runs, it would point to a system problem or an "aggressor app". We could even gather up all the data we have now over all users and do a scatter plot -- date/time on x-axis, and number of seconds to simulate a day on Y axis.

tangq commented 8 years ago

My slow runs first occurred in middle March. Some of the slow runs hung for a long time or died.

ndkeen commented 8 years ago

Hi @tangq, just to clarify, when you judge if a run is slow or not, you are looking at the tStamp lines in the cpl.log, correct? And the "normal" value of dt would be dependent on the resolution, PE layout, threads, etc, however, we are concerned about this value jumping from a normal value to a large value (~10x) during simulation. ?

worleyph commented 8 years ago

I had not realized that. "Funny" that the frequency of slow runs has increased. Perhaps this is some new job or user who is trashing the system when they run, and is running more often now. I did not think that the Aries interconnect was particularly sensitive in that way.

tangq commented 8 years ago

Hi @ndkeen , it is "yes" to all your questions. Note that for the fast run (red line in the second figure above), some abnormally slow days can happen. But slow days are more frequent in the slow run.

ndkeen commented 8 years ago

Pat: How do you know your runs are slow?

worleyph commented 8 years ago

For the most part, I am rerunning my prior benchmarks, from a few weeks ago when I was looking at PE layouts for the atmosphere and coupled experiment group. So, there are some expectations. Also, ACME is essentially determinstic algorithmically (being fully explicit except in the land ice) - the realtive cost of each timestep is (or can be) known. Only significant variability should be in I/O and MPI communication overhead (which also should be deterministic, but not with interference from other users).

ndkeen commented 8 years ago

Pat, can you allow me to access your run directory or tell me the values of your tStamp lines in cpl.log?

worleyph commented 8 years ago

Done. /scratch1/scratchdirs/worleyph/acme_scratch/FC5AV1C-L.ne30_oEC.edison_intel_ndkA

cameronsmith1 commented 8 years ago

There seem to be three aspects that people are referring to as 'slowdowns': 1) The time for an individual day (or month) can spike high (and this variability is often clustered). 2) The time for each day (or month) is steady, but much higher than a previous run. 3) The time for each day (or month) is steady, then suddenly jumps to a new steady rate during the run. Chris's upper plot from yesterday (above) shows all three issues. I have been assuming these are related to a single cause, but that doesn't have to be the case. I heard about the problem about 3 weeks ago, and then I experienced it when I started using Edison about 2 weeks ago.

worleyph commented 8 years ago

Update - I've run three times with the same compset and PE layout as @ndkeen , once with darshan loaded , once with darshan unloaded but with the original executable, and once with darshan unloaded when rebuilding and running. (Note that the executable size was different in this last instance.) All three runs had almost identical (slow) performance. Solution? Have @ndkeen run all jobs on Edison. (I'm not sure that I am kidding either. Perhaps @ndkeen should take over for at least one of @golaz 's runs, just to see what happens, and to get some work done.) P.S. - Nice summary @cameronsmith1

cameronsmith1 commented 8 years ago

This may be a coincidence, but my simulation overnight died complaining about memory corruption:

* glibc detected * /scratch2/scratchdirs/pjc/ACME_simulations/FC5AV1C-01_fix.FC5AV1C-01.ne30_ne30.edison.FC5AV1C-01_114nodes_p3fix_nooutput/build/cesm.exe: free(): corrupted unsorted chunks: 0x0000000011d76010 ***

Could this be related to the file corruption reported by Po-Lun and Phil Rasch?

Separately, I was trying out the sacct command, and I looked at the swap pages being used. I don't have much experience with what is normal, but it seemed to me that the swapping was large and variable between the nodes for my recent runs.

rljacob commented 8 years ago

My email says there will be a full-system reservation of Edison starting this Sunday and they "apologize for the short notice and are grateful for your patience in this time leading up to SC16 paper deadlines!" That SC16 deadline might explain why this is happening now, suggest one or more purposefully bad-behaving apps are responsible, and that things might get better after the deadline Sunday.

jayeshkrishna commented 8 years ago

I had been getting the "glibc corruption error" that @cameronsmith1 noted for all my 684_node + "-compset FC5AV1C-01 -res ne30_oEC" runs the past week on Edison (This is the error that I mentioned in confluence page related to this Issue - https://acme-climate.atlassian.net/wiki/pages/viewpage.action?pageId=57311734 ). However I was able to get other tests (unrelated to this issue) run on Edison last week without any errors (Are these errors possibly related to some partition - disk/compute - allocated to jobs on Edison?). Also, all the runs that succeeded for me on Edison last week were run on the debug queue (The crashed ones were run on regular queue).

ndkeen commented 8 years ago

Unfortunately, I still have no good news. I have continued to try experiments in an attempt to resolve or at least repeat the problem. I typically only run for 5,10,or 50 days. I ran one job for 424 days (headed to 500) and I see one blip in the timing (see figure). I can explain what I've tried so far, but I need to find a way to reproduce the problem in a semi-consistent way or at least much quicker than this. As the job size grows, the chances of getting in the queue diminish... yada yada.

timing The above plot was from data produced by job 322054.

Certainly if the reports from others above regarding memory and file corruption are related to this slowdown issue, then we (I) can attack it from that angle as well. I just have not seen it yet.

I have 3 experiments in the queue now and am thinking of others to try.

ndkeen commented 8 years ago

This is the command I use to create the case: ./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_58 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme

And then I have been grabbing the 10 and 58 node env_mach_pes from: https://acme-climate.atlassian.net/wiki/pages/viewpage.action?pageId=52429184

However, I modify them to use 24 cores per node, ie change the 48's to 24 in this: <entry id="MAX_TASKS_PER_NODE" value="48" /> <entry id="PES_PER_NODE" value="48" />

(This means that effectively my 2 different sizes of runs on edison are 20 and 116 nodes )

Assumptions: I assume this is what Pat been trying as well. Pat seems to have all slow days, every run. It is true that what Golaz has been running is different, but I assume the case Pat and I are running is capturing the same problem. I also assume it is OK to run the create_new_case above, but then copy over the "10 node env_mach_pes" file.

a) Since it seems that every day is slow for Pat, I was hoping he (or someone) can repeat this again? Perhaps change env_run to go for 10 days instead of 5? Or longer, just need more data, but don't want to sit in queue... Also, if you could capture the contents of the environment that would be great. In env_mach_specific there is a line module list >& software_environment.txt and you could add env >>& software_environment.txt

b) In attempt to get more smaller jobs thru the Q, does someone know what I can change to use even fewer nodes? From the env_mach_pes.xml file, I see:

`'''

'''`

Can I read this as: there are 480 cores used total, on 450 of them ATM is running, on 288 of them LAND is running, etc.. ? That is there is overlap on some nodes, but not others.
It's not clear how to adjust these. Can I simply reduce the values of all NTASKS and rerun cesm_step (after cleaning)? This may already be near the memory limit, I'm sure.

Again, I have several longer experiments in the Q now.

golaz commented 8 years ago

@ndkeen, looks like you were running the atmosphere only model while @tangq and I were running the coupled model (A_WCYCL2000 compset).

helenhe40 commented 8 years ago

I checked user reports on Edison slowness from last week, and it seems these reports came from CESM, ACME, and WRF runs, but not other applications. The main assumption from NERSC now is there are some aggressor applications running in the past 2 weeks or so that affect these highly network sensitive type of applications.

Last spring when large CESM performance variations were reported on Edison, we were able to identify the aggressor application to be "DGDFT" that has the characteristics of large MPI_Alltoall spanning multiple groups while using small pages. A general recommendation from Cray "to use large pages for all applications and reboot nodes when memory fragmentation reaches certain threshold level" is not practical for us since the reboot is not fast enough.

The plan is to check against applications running on Edison at the same time when slow runs were reported. From Pat, we have the following job ids:

JobId=311413 was slow JobId=310901 was slow JobId=310747 was slow JobId=303364 was slow

JobId=302649 was fast

Others can add more job ids to the above list please.

Another possible cause of the slowness may come from memory allocation on the nodes being gated by Lustre operations. This is seen in another application this morning for a large data set. We will be checking this possibility by monitoring CESM jobs running in real time.

I have submitted a job (job id: 329405) in the queue using the default PE layout that comes with the "master" version that I just got from github this afternoon, freshly built executable. "env_mach_pes.xml" has the following:

The SLURM run script has:

SBATCH --job-name=FC5AV1C-L.ne30_oEC.edison_intel_58

SBATCH --nodes=80

SBATCH --ntasks-per-node=12

SBATCH --output=FC5AV1C-L.ne30_oEC.edison_intel_58

SBATCH --exclusive

SBATCH --time=01:15:00

SBATCH --partition=regular

SBATCH --account=acme

qx( srun --label -n 960 -c 4 $config{'EXEROOT'}/cesm.exe >> cesm.log.$LID 2>&1 );

Could someone please enable me to access the wiki pages at https://acme-climate.atlassian.net so I can grab the 58 and 684 node layouts as Pat suggested? I will do more tests, and also forward Noel's job ids in the queue to Edison system people.

Helen

On Mon, Apr 11, 2016 at 1:30 PM, noel notifications@github.com wrote:

This is the command I use to create the case: ./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_58 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme

And then I have been grabbing the 10 and 58 node env_mach_pes from:

https://acme-climate.atlassian.net/wiki/pages/viewpage.action?pageId=52429184

However, I modify them to use 24 cores per node, ie change the 48's to 24 in this:

(This means that effectively my 2 different sizes of runs on edison are 20 and 116 nodes )

Assumptions: I assume this is what Pat been trying as well. Pat seems to have all slow days, every run. It is true that what Golaz has been running is different, but I assume the case Pat and I are running is capturing the same problem. I also assume it is OK to run the create_new_case above, but then copy over the "10 node env_mach_pes" file.

a) Since it seems that every day is slow for Pat, I was hoping he (or someone) can repeat this again? Perhaps change env_run to go for 10 days instead of 5? Or longer, just need more data, but don't want to sit in queue...

b) In attempt to get more smaller jobs thru the Q, does someone know what I can change to use even fewer nodes? From the env_mach_pes.xml file, I see:

Can I read this as: there are 480 cores used total, on 450 of them ATM is running, on 288 of them LAND is running, etc.. ? That is there is overlap on some nodes, but not others.

It's not clear how to adjust these. Can I simply reduce the values of all NTASKS and rerun cesm_step (after cleaning)? This may already be near the memory limit, I'm sure.

Again, I have several longer experiments in the Q now.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ACME-Climate/ACME/issues/820#issuecomment-208545135

tangq commented 8 years ago

My slow job IDs: 261879 296587 302963

The only fast job I have: 301305

ndkeen commented 8 years ago

I have been running various experiments, but it looks like I still only have 1 abnormally slow day

job 322054 ran 424 days shows only 1 day with erratic timing

All other jobs seem to be OK: job 313667 ran 163 days job 321950 ran 50 days job 322036 ran 50 days job 326021 ran 5 days job 325483 ran 5 job 320763 ran 43 days job 322041 ran 42 days job 313455 ran 5 days job 329489 ran 5 days

several jobs that finished normally, show that the last day was more expensive (additional writing)?

helenhe40 commented 8 years ago

Thanks for the job ids you provided.

With the default PE layout, using 80 nodes, run for 5 days, it does not seem to be slow for me? On Called Recurse Wallclock max min UTR Overhead CPL:INIT - 2 - 60.130451 59.116833 1.013621 0.000000 CPL:RUN_LOOP - 240 - 1925.859009 81.392860 0.886180 0.000015 CPL:ATM_RUN - 240 - 732.215576 69.053459 0.620120 0.000015

I have the 58 nodes layout job in the queue now.

Helen

tangq commented 8 years ago

@ndkeen , @helenhe40 , it seems that you both are testing with the F (atm only) compset. Can you try a coupled run? My recent slow run can be reproduced with the attached script.

I am also testing the F compset and the throughput seems normal (it completed 3 months) so far.

run_acme.alpha_20160401.A_WCYCL2000.edison_half.txt

helenhe40 commented 8 years ago

The new job in the queue I have has the default env_mach_pes.xml swapped with the 58 node layout.

What else do I need to do to change the ATM only run to a coupled run? Do I need to create a new case using "-compset FC5AV1C-01 -res ne30_oEC" instead of the compset provided by Pat's example below? ./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_58 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme

Helen

tangq commented 8 years ago

@helenhe40 , you'll need -compset A_WCYCL2000 -res ne30_oEC to create a coupled simulation. Or you can use my script above.

helenhe40 commented 8 years ago

Will do. Thanks!

With -compset A_WCYCL2000 -res ne30_oEC, do I just use the default PE layout, or do you suggest to use some other layouts such as the 58 or 864 node layout?

Helen

On Mon, Apr 11, 2016 at 5:35 PM, tangq notifications@github.com wrote:

@helenhe40 https://github.com/helenhe40 , you'll need -compset A_WCYCL2000 -res ne30_oEC to create a coupled simulation. Or you can use my script above.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ACME-Climate/ACME/issues/820#issuecomment-208641123

tangq commented 8 years ago

I used the attached PE layout (153 nodes) created by Pat, which is expected to have a throughput of 5.8 SYPD. You will need to replace the env_mach_pes.xml with this file before cesm_setup. The PE layout is already hard-coded in my script.

env_mach_pes.xml.txt

cameronsmith1 commented 8 years ago

Hi Helen, In case it isn't obvious, the script @tangq provided above (run_acme.alpha_20160401.A_WCYCL2000.edison_half.txt) is an executable shell script. It looks like he has set it up so it should run out-of-the-box.

I currently have 3 simulations going:

321445: 684 nodes, running at half the expected speed (8 SYPD rather than 16 SYPD).

321448 & 321447 are running very very slowly, but I had messed around with the IO settings, which is likely the cause of the slow speed, and I will probably just kill them.

mt5555 commented 8 years ago

random thought: @ndkeen specifically disabled hyperthreading. I wonder if everyone else is using the default hyperthreading, and that could explain why @ndkeen cant reproduce the slowness?

cameronsmith1 commented 8 years ago

If @helenhe40 is correct, then perhaps the step function in the blue timing curve of @golaz (2016-04-04, 40nodes) indicates a specific time the conflicting code started running. @helenhe40 , will it be possible to identify which simulations started at about the wall-clock time @golaz sees the step change?

cameronsmith1 commented 8 years ago

For my 684 node runs, Pat's layout has

so if I understand it correctly, this should not be using hyperthreading (correct?). I have definitely been having problems with my runs, although some of them haven't been too terrible (ie, only 2x slowdown compared to Pat's timed throughput).

cameronsmith1 commented 8 years ago

Hi @helenhe40, you asked about accessing the Confluence page with the PE layout files. You appear to have an account on the ACME Confluence system. Are you able to log in and see any of our Confluence pages?

helenhe40 commented 8 years ago

@cameronsmith1 Yes, I have access to the Confluence page now, and was able to get Pat's 58 node layout file.

I have set up a run with "-compset FC5AV1C-01 -res ne30_oEC" since Pat reported slowness with this. Is it a coupled run?

I also tried to set up another case using @tangq suggestion of "-compset A_WCYCL2000 -res ne30_oEC", and the env_mach_pes.xml.txt he attached of using 153 nodes.

However, the model build failed due to the following error:

cp: cannot stat `/scratch1/scratchdirs/yunhe/ACME/cime/..//components/mpas-cice/model/src/*': No such file or directory

It looks like the "master" version of ACME code does not have mpas-cice source codes.

I will try to use @tangq "run_acme.alpha_20160401.A_WCYCL2000.edison_half.txt" directly tomorrow.

@cameronsmith1 @tangq

If @helenhe40 is correct, then perhaps the step function in the blue timing curve of >@golaz (2016-04-04, 40nodes) indicates a specific time the conflicting code started >running. @helenhe40 , will it be possible to identify which simulations started at about the >wall-clock time @golaz sees the step change? Could you please provide more information on the job id, and the time stamp of the explicit step change, such as: tStamp_write: model date = 10723 0 wall clock = 2016-04-05 15:21:25 avg dt = 376.15 dt = 510.71 tStamp_write: model date = 10724 0 wall clock = 2016-04-05 15:29:34 avg dt = 376.70 dt = 489.11

worleyph commented 8 years ago

@helenhe40 , if it is any help, I also capture the jobs that are running at the same time as the ACME jobs. For some of my slow and fast jobs in the previous list, look in

 /project/projectdirs/acme/performance_archive/worleyph/FC5AV1C-L.ne30_oEC.edison_intel_684/

(slow) 310901: in ./160405-094751 310747: in ./160405-083743 303364: in ./160404-082735

(fast) 302649: in ./160403-021924

The full sqs -f list just prior to the job start is in a file called sqsf.NNN-MMM.gz. There is also squeue output (in squeue.NNN-MMM.gz) for when the job just started running (probably more useful). And there is also sqs -w output in the file called sqs2.NNN-MMM.gz (so similar data to the squeue ouput).

Then in the ./checkpoints subdirectory are periodic calls to sqs -w (called sqsw.NNN-MMM.[seconds until job expires]), which can also be used to examine what jobs are running concurrently throughout the run, usually every 15 minutes.

We have this for all ACME runs, not just mine. Please look at these and determine if you would like to see more.

douglasjacobsen commented 8 years ago

@helenhe40 You need to git submodule update --init from the top level, before you build your case.

More specifically, I'd guess you're not using the hooks repository in your clone, which means you need to manually manage the submodules. (See https://acme-climate.atlassian.net/wiki/display/Docs/Development+Quick+Guide for more info)

worleyph commented 8 years ago

@helenhe40 , note that we also record all of the PEs allocated to each job, and what process is running on each. This is also in the checkpoints subdirectory, named cesm.log.NNN-MMM.[timestamp] (output from setting

 setenv MPICH_CPUMASK_DISPLAY 1

before each run).

tangq commented 8 years ago

@mt5555 , in my coupled run over the weekend, I changed MAX_TASKS_PER_NODE from 48 to 24 (which disables hyperthreading?). The first 5-year simulation was fast, but died when generating restart files by the end of 5th year. However, the restart simulation for the same run was slow.

golaz commented 8 years ago

@helenhe40, the jobid corresponding to the blue line in the plot above is 201420. The transition between fast and slow occurred around 2016-04-04 21:20.

@mt5555, I also tried changing MAX_TASKS_PER_NODE from 48 to 24 but did not see any impact on performance.

ndkeen commented 8 years ago

It may be that we have 2 different issues here.

Issue A) My understanding was that the problem @worleyph suggested is the one I should be trying to address. Which is:

`./create_newcase -case FC5AV1C-L.ne30_oEC.edison_intel_58 -compset FC5AV1C-L -res ne30_oEC -mach edison -compiler intel -project acme Get the 10 node env_mach_pes from: https://acme-climate.atlassian.net/wiki/pages/viewpage.action?pageId=52429184 Modify them to use 24 cores per node, ie change the 48's to 24 to this:

`

For this problem, @worleyph claims he sees every job he submits yields ONLY slow days. I have run hundreds of days (including one running right now over 200 days so far) and simply can't repeat this. I've seen 1 day out of ~1000 was abnornally slow, which I would otherwise ignore. It looks like @helenhe40 is also unable to reproduce the behaviour that @worleyph is experiencing.

Issue B) I see that @tangq @golaz @cameronsmith1 are running a coupled problem with a script and are suggesting we try this. I just tried the following:

./create_newcase -case coupled01 -compset A_WCYCL2000 -res ne30_oEC -mach edison -compiler intel -project acme
Get the env_mach_pes.xml from @tangq above and make the same modification changing 48->24

This submits a job that is requesting 306 nodes.

For Issue A, we should either get to the bottom of why 3 users are seeing different behavior or suggest @worleyph debug as it is incredibly easier to debug something that is consistent and happens with even a small job in the debug queue. If this ATM-only run really is having the same problems as the coupled, then great -- we have something smaller to debug.

If Issue B is another problem, then is it OK that I first try using a bare create_newcase as described above? As opposed to using your script?

Note that regarding the ATM-only run, I have also tried a debug version and did not see any issues. I also ran the same problem on cori (with Intel16) both opt & debug and do not see anything.

worleyph commented 8 years ago

@ndkeen, please note that I also saw consistent slow performance with the A_WCYCL2000 case . See my initial posting. @golaz (and others) saw slow performance from the A_WCYCL2000 compset as well, and @cameronsmith1 (and others) saw slow performance from the FC5AV1C-01 or FC5AV1C-L compset.

The others are running much more than I am (more often and for much longer durations). Since even old executables have shown slower performance at times, I think that it is almost 100% sure that this is an external, runtime, issue. We have provided (or can provide) LOTS of information for NERSC to determine whether there is another user who is triggering this problem for us.

I'll keep running jobs that you suggest (have one in the queue), to see if I can get fast performance again. Hopefully in the meantime NERSC can identify another user or an unreported system software change that might be the source of the. Since you are running fine, the latter hypothesis is clearly not true. Since no one running slowly has been mucking with there build and run environments, I also doubt that this is the root of the problem.

ndkeen commented 8 years ago

@worleyph I can not find where you commented on running the A_WCYCL2000 case. I went back to the confluence page and tried re-reading that -- it is a bit confusing, and I can't seem to figure what exactly was tried. I also don't see where @cameronsmith1 also had trouble with the ATM-only case.

I'm not convinced that these issues are external.

As it is going so slow for me to try experiments, in the meantime, I'd like to get more information about what has already been tried. You stated that even old executables show performance problems -- can someone explain this again?

E3SM-Project / E3SM

(very) slow performance on Edison #820

SBATCH --job-name=FC5AV1C-L.ne30_oEC.edison_intel_58

SBATCH --nodes=80

SBATCH --ntasks-per-node=12

SBATCH --output=FC5AV1C-L.ne30_oEC.edison_intel_58

SBATCH --exclusive

SBATCH --time=01:15:00

SBATCH --partition=regular

SBATCH --account=acme