E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
336 stars 338 forks source link

Fixes to provenance/timing saving #1201

Closed jgfouca closed 7 years ago

jgfouca commented 7 years ago

1) Timing directory in RUNDIR needs to have lid appended 2) Create timing.$LID.saved 3) Do not include full directory hierarchy in tar files 4) Copy all timing files to performance archive 5) Restore support for mach_syslog 6) In the case timing subdirectory, stop compressing the performance summary file and start compressing the raw global statistics timing data. 7) For create_test, if save_timing is false, be sure to set SAVE_TIMING to False

Fixes #1177

[BFB]

worleyph commented 7 years ago

@jgfouca

Timing directory in RUNDIR needs to have lid appended

Check

Create timing.$LID.saved

Check. However, this file is also copied to the performance_archive, and it serves no purpose there. Could you please NOT copy it there?

Do not include full directory hierarchy in tar files

Check

Copy all timing files to performance archive

Check

Restore support for mach_syslog

Not working yet.

One problem is that the current mach_syslog assumes that the model run log is called, cesm.log, not acme.log. This is hardwired. I can change this to be an input parameter, but you will need to add this argument to the call from the run script. I don't know where this is. Where should I look? Also, does this make sense? I assume that CEMS model runs do not produce an acme.log file, but prehaps CESM does not care to support mach_syslog, in which case there is no reason to add this generality.

Also, the name of the performance_archive checkpoint subdirectory is incorrect. Looking at the output of some instrumentation I added, it looks that $dir is being passed in as checkpoints.$lid/checkpoints , which does not exist. I assume that it should be checkpoints.$lid .

In the case timing subdirectory, stop compressing the performance summary file and
start compressing the raw global statistics timing data.

Check

For create_test, if save_timing is false, be sure to set SAVE_TIMING to False

Someone else will need to test this.

Thanks.

When the above issues are addressed, I'll look at this again, to check more carefully if the current implementation matches that in ACME/CIME2.

jgfouca commented 7 years ago

@worleyph

Could you please NOT copy it there?

Fixed.

... mach_syslog ...

I've fixed the issue with the log names. CESM is not using this, so we can just assume acme logs. I also fixed checkpoint argument. I'm completely unfamiliar with mach_syslog so I have no idea if it's broken in other ways due to the CIME upgrade.

worleyph commented 7 years ago

@jgfouca , thanks. mach_syslog is working now.

A couple of remaining issues (on Titan). I'll test on the other systems once the following are taken care of.

a) Permissions are wrong. All files should be group and world readable, and directories should also be group writable.

For example: CIME2

 -rw-r--r-- 1 worley worley    896 Dec  1 18:51 3096778.OU.161201-183002.gz
 -rw-r--r-- 1 worley worley    525 Dec  1 18:29 build_environment.161201-183002.gz
 drwsrwsr-x 2 worley cli112   4096 Dec  1 18:51 CaseDocs.161201-183002
 -rw-r--r-- 1 worley worley   2502 Dec  1 18:51 CaseStatus.161201-183002.gz
 -rw-r--r-- 1 worley worley 127066 Dec  1 18:51 cesm.log.161201-183002.gz
 -rw-r--r-- 1 worley worley   2172 Dec  1 18:51 cesm_timing.ACME_WCYCL2000.ne30_oEC_pgi_mastercheck_b.161201-183002.gz
 -rw-r--r-- 1 worley worley  18609 Dec  1 18:51 cesm_timing_stats.161201-183002.gz
 drwsrwsr-x 2 worley cli112   4096 Dec  1 18:51 checkpoints

and CIME5

 -rw------- 1 worley cli112   1073 Jan  4 11:13 3160568.OU.170104-105055.gz
 -rw------- 1 worley cli112 265359 Jan  4 11:13 acme.log.170104-105055.gz
 -rw------- 1 worley cli112   2121 Jan  4 11:13 acme_timing.A_WCYCL2000_ne30_oEC_PR1201_pgiacc2.170104-105055.gz
 -rw------- 1 worley cli112  15829 Jan  4 11:13 acme_timing_stats.170104-105055.gz
 -rw------- 1 worley cli112   7536 Jan  4 11:13 build_environment.txt.170104-105055.gz
 drwx--S--- 2 worley cli112   4096 Jan  4 11:14 CaseDocs.170104-105055
 -rw------- 1 worley cli112    164 Jan  4 11:13 CaseStatus.170104-105055.gz
 drwx--S--- 2 worley cli112  24576 Jan  4 13:29 checkpoints.170104-105055
 ...

The same needs to be true in subdirectories as well (i.e. CaseDocs.170104-105055 and heckpoints.170104-105055).

NOTE - having group being cli112 instead of worley is good - thanks for this.

b) Some of the timestamps have been reset??? In mach_syslog, files are copied to the checkpoints subdirectory with '-p', and these timestamps are preserved through subsequent manipulation in CIME2. Something in CIME5 is causing the timestamps to be updated. It is unclear from these experiments whether timestamps are being reset outside of ./checkpoints/ as well (e.g. in the CaseDocs subdirectory) - I'll investigate further if it is not obvious to you. Note that these timestamps provide useful performance and provenance information.

In general, if files are being copied or renamed, could we use '--preserve=timestamps' to keep the timestamps? We should change mach_syslog to use '--preserve=timestamps' also. If you are willing to do this, that will be great. Otherwise I'll try it out once we figure out why timestamps are being changed in the checkpoints subdirectory, as I won't be able to check whether it is doing the right thing until then.

c) Question: The 'top level' has build_environment.txt.$lid.gz . In CaseDocs.$lid there is also a (new) file software_environment.txt.%lid.gz . Are these identical? I know that we will soon be adding run_environment.txt.$lid.gz, but I do not believe that this is in there yet, correct?

jgfouca commented 7 years ago

@worleyph

Permissions are wrong.

OK, I will set the umask at the beginning of the run.

Some of the timestamps have been reset

Ah, this is probably because I'm using shutil.copy instead of cp -p. I can fix.

Question: The 'top level' has build_environment.txt.$lid.gz . In CaseDocs.$lid there is also a (new) file software_environment.txt.%lid.gz . Are these identical? I know that we will soon be adding run_environment.txt.$lid.gz, but I do not believe that this is in there yet, correct?

I think one comes from case.setup and one comes from case.build. run_env is in CIME5.2 but not yet in ACME.

worleyph commented 7 years ago

Note, the performance_archive structure could be improved. Don't want to hold this PR back, and I am not sure what coordination is needed with CESM (if any). At the moment I have added things to CaseDocs that do not appear there in the user's case directory. We should probable have a BuildDocs subdirectory for this extra information, as well as some things currently at the top level like build_environment and GIT_DESCRIBE and SourceMods. We could even have a RunDocs, but that is everything else, so may not add any utility.

worleyph commented 7 years ago

I have no experience with setting umask globally (if that is what your comment meant), but I'm guessing that cp -p might override this, and that you want to use '--preserve=timestamps' instead?

rljacob commented 7 years ago

Is " performance_archive structure could be improved" a new feature? Yes lets save that for a future PR. This PR should get the beta0 version of performance archiving working in cime5 as best as it can.

worleyph commented 7 years ago

@rljacob , yes, sorry to have even brought this up.

jgfouca commented 7 years ago

@worleyph I think I've addressed your concerns

worleyph commented 7 years ago

@jgfouca , the timestamps still are not being preserved in the checkpoints subdirectory. Since many of these are created in the checkpoints directory by mach_syslog, some postprocessing is changing the timestamp? gzip doesn't do this when using /usr/bin/gzip. Are you using a python utlity to compress these? Is there anything else going on with CIME5 and the checkpoints subdirectory?

Thanks.

jgfouca commented 7 years ago

@worleyph yes, I'm using a python gzipper. I'll change it to preserve the timestamp of the original file.

jgfouca commented 7 years ago

@worleyph OK try now. Sorry for the trial-and-error approach we've had to take with this PR; it's hard to test the correctness of this stuff.

worleyph commented 7 years ago

@jgfouca ,

OK try now.

Titan is backed up today ... may take awhile to get runs through. I don't want to start testing on other systems until I verify that everything is working okay on Titan (for my sanity's sake), but I expect everything to work fine this time.

Sorry for the trial-and-error approach we've had to take with this PR; it's hard to test the correctness of this stuff.

I very much appreciate your patience with this process - it is what I expected. I was able to be much more involved, working with Jeff Johnson, for the port to CIME2, but that was not an option this time. The 'fix' to the permissions issue, even though seemingly minor, will remove a major headache going forward. I hope that it works as expected at NERSC, but will verify as soon as I get the chance.

worleyph commented 7 years ago

Everything seems to be working fine on Titan. I'll start trying out the other platforms.

worleyph commented 7 years ago

@jgfouca , testing on cori-haswell, job died with:

 ERROR: Command: 'git describe > /project/projectdirs/acme/performance_archive/worleyph/A_WCYCL2000_ne30_oEC_PR1201_corihaswell3/170110-193504/GIT_DESCRIBE.170110-193504' failed with error 'fatal: Not a git repository (or any parent up to mount point /global/u2/w/worleyph/ACME/jgfouca/cime/perf_data_fixes/ACME/cime)
 Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).' from dir '/global/u2/w/worleyph/ACME/jgfouca/cime/perf_data_fixes/ACME/cime'
worleyph commented 7 years ago

@jgfouca , testing on Edison looks good, except for one minor anomaly. The last files copied into the checkpoint subdirectory in performance_archive were not gzipped. I am guessing that the mach_syslog was not killed until after the files in this subdirectory were gzipped, so a few more copied over. Don't know if the kill was simply delayed, and this was a rare event, or whether the order of operations is different in CIME5 than in CIME2. Since it works on Titan, guess that it is the former, but would like for you to confirm.

jgfouca commented 7 years ago

@worleyph this problem has been fixed by a later branch. The git describe problem I mean.

worleyph commented 7 years ago

@jgfouca , Can I back port this (so that I can test) your branch? What would I look for? (Is this a CIME branch or an ACME branch?)

jgfouca commented 7 years ago

@worleyph I will rebase this branch, just a moment.

jgfouca commented 7 years ago

@worleyph Done.

worleyph commented 7 years ago

@jgfouca , cori-haswell does not create a checkpoints subdirectory (so there is also nothing in it, and can't test mach_syslog until the subdirectory is created).

There are also some other missing items ... sqsw, sqsf, sqsf_jobid, and squeuef .

Thanks.

worleyph commented 7 years ago

Looks like the name change from corip1 to cori-haswell is not in there yet? Perhaps this is would be a second PR, but I'd really like this PR to get everything working on the current target platforms, which include cori-haswell I believe.

ndkeen commented 7 years ago

No, my branch for allowing ACME to run on cori-haswell and cori-knl has not yet made it. I'd prefer to get it in asap, but it was suggested I get all the tests to pass first.

worleyph commented 7 years ago

@ndkeen, okay - then you'll need to update provenance.py as well. And you'll want me to test the provenance capture capability. @jgfouca , would you prefer to ignore cori-haswell (and cori-knl) for now or go ahead and get this part working.

rljacob commented 7 years ago

@ndkeen since the testing is taking a while, go ahead and post the PR for cori-haswell and cori-knl.

jgfouca commented 7 years ago

OK, I will rebase this branch onto master once Noel's branch has been merged.

worleyph commented 7 years ago

@jgfouca , it appears the that 'kill mach_syslog' logic in provenance.py is not working on Edison. I added some logger.warning calls:

# Kill mach_syslog
logger.warning("PW: Before kill syslog")
job_id = _get_batch_job_id_for_syslog(case)
if job_id is not None:
    logger.warning("PW: In kill syslog")
    syslog_jobid_path = os.path.join(rundir, "syslog_jobid", ".%s" % job_id)
if os.path.exists(syslog_jobid_path):
        logger.warning("PW: In kill syslog part 2")
        try:
            with open(syslog_jobid_path, "r") as fd:
                syslog_jobid = int(fd.read().strip())
            os.kill(syslog_jobid, signal.SIGTERM)
    logger.warning("PW: Killed syslog")
        except (ValueError, OSError) as e:
            logger.warning("Failed to kill syslog: %s" % e)
        finally:
            os.remove(syslog_jobid_path)

# copy/tar timings
logger.warning("PW: After kill syslog")

and output was

 2017-01-12 06:19:01 MODEL EXECUTION HAS FINISHED
 PW: Before kill syslog
 PW: In kill syslog
 PW: After kill syslog
 check for resubmit

so it looks like the 'if os.path.exists(syslog_jobid_path):' test is failing. I'll check whether the same issue occurs on TItan.

Note that the explicit kill is due to a past problem (can't remember the system) in which the background job kept the job scripts running until time ran out, and sometimes was left as an orphan that caused performance issues on the job submission node (as these accumulate over time).

jgfouca commented 7 years ago

@worleyph I don't understand the history of the syslog stuff. Does the lack of syslog_jobid_path mean that it does not need to be killed? At first glance, that is how I would interprety this code. Did you see that syslog needed to be killed but wasn't?

jgfouca commented 7 years ago

@worleyph on an unrelated note. I've pushed a commit to this branch to address the concern you raised on slack regarding batch job output files.

worleyph commented 7 years ago

@jgfoucar.

Did you see that syslog needed to be killed but wasn't?

Yes. I'll look back at the CIME2 implementation, and but may need you to walk me through the CIME5 implementation to understand why they are different (if they are).

worleyph commented 7 years ago

Original code was

 if ($app_jobid != 'unknown') then

   set syslog_jobid = 0
   if (-e $run_dir/syslog_jobid.$app_jobid) then
     set syslog_jobid = `cat $run_dir/syslog_jobid.$app_jobid`
     /bin/rm -f $run_dir/syslog_jobid.$app_jobid
   endif

   if ($syslog_jobid != 0) then
     if { kill -0 $syslog_jobid } then
       kill $syslog_jobid
     endif
   endif

 endif

The new code is

# Kill mach_syslog                                                             
job_id = _get_batch_job_id_for_syslog(case)
if job_id is not None:
    syslog_jobid_path = os.path.join(rundir, "syslog_jobid", ".%s" % job_id)
    if os.path.exists(syslog_jobid_path):
        try:
            with open(syslog_jobid_path, "r") as fd:
                syslog_jobid = int(fd.read().strip())
            os.kill(syslog_jobid, signal.SIGTERM)
        except (ValueError, OSError) as e:
            logger.warning("Failed to kill syslog: %s" % e)
        finally:
            os.remove(syslog_jobid_path)

So, how would I add a logger.warning to print out the syslog_jobid_path, to see why it is not finding the syslog_jobid file? Something like

   logger.warning("syslog_jobid_path= %s" %syslog_jobid_path)

?

Note that this writing out the syslog_jobid and then reading it back in was necessitated by the approach taken to bolt this onto the CIME2 design, introducing pre and post run scripts with and without ACME customizations. (It used to all be in the run script, and information was not lost between the mach_syslog spawn and the need to kill it) If you can capture and preserve the mach_syslog id in a less hacky way, that would be preferred.

worleyph commented 7 years ago

If you can capture and preserve the mach_syslog id in a less hacky way, that would be preferred.

Sorry @rljacob - future customization, unless we can't get the current implementation working.

mfdeakin-sandia commented 7 years ago

You probably want logger.info("syslog_jobid_path= %s" %syslog_jobid_path) as logger.warning is off by default

jgfouca commented 7 years ago

@mfdeakin-sandia incorrect, logger.debug is off by default, logger.warning is even higher level than logger.info.

worleyph commented 7 years ago

/scratch1 is down on Edison at the moment - I'll test tomorrow when it comes back. I'll look at Titan in the meantime.

worleyph commented 7 years ago

@jgfouca , question about your latest commit ... you added a timestamp, changing

 self.output_error_path = self.job_id

to

 self.output_error_path = self.job_id + get_timestamp()

The problem on Edison (and Cori?) is that the job_id was missing from the job output file name. Don't know if this is the same as the output_error_path. The job_id is what I was requesting, unless it is not available on Edison? Timestamp will generate unique files, but seems a little less useful than jobid.

worleyph commented 7 years ago

@jgfouca , what happens in

  os.kill(syslog_jobid, signal.SIGTERM)

if syslog_jobid is not a valid PID, either because it already ended/died or is someone else's process? This was an issue on Mira for the csh implementation, which is why I wrapped it with

 if { kill -0 $syslog_jobid } then

Will python continue on without killing the postprocessing script?

worleyph commented 7 years ago

@jgfouca , just printed out syslog_jobid_path for an example run on Titan (shortened to be readable)

 syslog_jobid_path= xxx/run/syslog_jobid/.3173421

via

 logger.warning("syslog_jobid_path= %s" %syslog_jobid_path)

Note the error. Needed to be xxx/run/syslog_jobid.3173421 .

jgfouca commented 7 years ago

@worleyph we don't have the jobid yet when this gets set up.

jgfouca commented 7 years ago

@worleyph we might need a more sophisticated fix. I'll look into your other issues as well.

worleyph commented 7 years ago

@jgfouca : "@worleyph we don't have the jobid yet when this gets set up." I don't know what we are talking about here. The bug was simply the addition of a '/' when building the string syslog_jobid_path. Can't tell where this is coming from since I do not know what os.path.join is supposed to be doing.

jgfouca commented 7 years ago

@worleyph, I've added the jobid to slurm output files. I didn't realize how easy this was (just needed %j in the output name). I've revert the timestamp hack.

jgfouca commented 7 years ago

@worleyph I've fixed the syslog_jobid path. See if things work now.

jgfouca commented 7 years ago

@worleyph , the branch has been rebased and should be ready to test.

worleyph commented 7 years ago

Thank you. I'll give it a whirl.

rljacob commented 7 years ago

Can someone summarize which machines this has been verified on? I know Titan is ok. What else?

rljacob commented 7 years ago

PR #1222 was just now merged to master so this probably needs to be rebased again to work on cori.

jgfouca commented 7 years ago

@rljacob Done.

worleyph commented 7 years ago

@jgfouca , titan experiments worked; edison experiments worked.

For Cori, provenance.py still refers to corip1, and the syslog.corip1 needs to be renamed as syslog.cori-haswell and syslog.cori-knl . I'll make these changes locally and try out on cori-haswell. Could you please make the same updates to this PR (after I report back that this was sufficient)?

worleyph commented 7 years ago

@jgfouca , cori-haswell worked, once I made the indicated fixes to provenance.py and renamed syslog.corip1. However, cori-knl experiment died with the error:

 ERROR: Command: 'git describe > /project/projectdirs/acme/performance_archive/worleyph/A_WCYCL2000_ne30_oEC_PR1201_cori-knl1/170120-205944/GIT_DESCRIBE.170120-205944' failed with error '/bin/sh: git: command not found' from dir '/global/u2/w/worleyph/ACME/jgfouca/cime/perf_data_fixes/ACME'