Closed jgfouca closed 7 years ago
@jgfouca
Timing directory in RUNDIR needs to have lid appended
Check
Create timing.$LID.saved
Check. However, this file is also copied to the performance_archive, and it serves no purpose there. Could you please NOT copy it there?
Do not include full directory hierarchy in tar files
Check
Copy all timing files to performance archive
Check
Restore support for mach_syslog
Not working yet.
One problem is that the current mach_syslog assumes that the model run log is called, cesm.log, not acme.log. This is hardwired. I can change this to be an input parameter, but you will need to add this argument to the call from the run script. I don't know where this is. Where should I look? Also, does this make sense? I assume that CEMS model runs do not produce an acme.log file, but prehaps CESM does not care to support mach_syslog, in which case there is no reason to add this generality.
Also, the name of the performance_archive checkpoint subdirectory is incorrect. Looking at the output of some instrumentation I added, it looks that $dir is being passed in as checkpoints.$lid/checkpoints , which does not exist. I assume that it should be checkpoints.$lid .
In the case timing subdirectory, stop compressing the performance summary file and
start compressing the raw global statistics timing data.
Check
For create_test, if save_timing is false, be sure to set SAVE_TIMING to False
Someone else will need to test this.
Thanks.
When the above issues are addressed, I'll look at this again, to check more carefully if the current implementation matches that in ACME/CIME2.
@worleyph
Could you please NOT copy it there?
Fixed.
... mach_syslog ...
I've fixed the issue with the log names. CESM is not using this, so we can just assume acme logs. I also fixed checkpoint argument. I'm completely unfamiliar with mach_syslog so I have no idea if it's broken in other ways due to the CIME upgrade.
@jgfouca , thanks. mach_syslog is working now.
A couple of remaining issues (on Titan). I'll test on the other systems once the following are taken care of.
a) Permissions are wrong. All files should be group and world readable, and directories should also be group writable.
For example: CIME2
-rw-r--r-- 1 worley worley 896 Dec 1 18:51 3096778.OU.161201-183002.gz
-rw-r--r-- 1 worley worley 525 Dec 1 18:29 build_environment.161201-183002.gz
drwsrwsr-x 2 worley cli112 4096 Dec 1 18:51 CaseDocs.161201-183002
-rw-r--r-- 1 worley worley 2502 Dec 1 18:51 CaseStatus.161201-183002.gz
-rw-r--r-- 1 worley worley 127066 Dec 1 18:51 cesm.log.161201-183002.gz
-rw-r--r-- 1 worley worley 2172 Dec 1 18:51 cesm_timing.ACME_WCYCL2000.ne30_oEC_pgi_mastercheck_b.161201-183002.gz
-rw-r--r-- 1 worley worley 18609 Dec 1 18:51 cesm_timing_stats.161201-183002.gz
drwsrwsr-x 2 worley cli112 4096 Dec 1 18:51 checkpoints
and CIME5
-rw------- 1 worley cli112 1073 Jan 4 11:13 3160568.OU.170104-105055.gz
-rw------- 1 worley cli112 265359 Jan 4 11:13 acme.log.170104-105055.gz
-rw------- 1 worley cli112 2121 Jan 4 11:13 acme_timing.A_WCYCL2000_ne30_oEC_PR1201_pgiacc2.170104-105055.gz
-rw------- 1 worley cli112 15829 Jan 4 11:13 acme_timing_stats.170104-105055.gz
-rw------- 1 worley cli112 7536 Jan 4 11:13 build_environment.txt.170104-105055.gz
drwx--S--- 2 worley cli112 4096 Jan 4 11:14 CaseDocs.170104-105055
-rw------- 1 worley cli112 164 Jan 4 11:13 CaseStatus.170104-105055.gz
drwx--S--- 2 worley cli112 24576 Jan 4 13:29 checkpoints.170104-105055
...
The same needs to be true in subdirectories as well (i.e. CaseDocs.170104-105055 and heckpoints.170104-105055).
NOTE - having group being cli112 instead of worley is good - thanks for this.
b) Some of the timestamps have been reset??? In mach_syslog, files are copied to the checkpoints subdirectory with '-p', and these timestamps are preserved through subsequent manipulation in CIME2. Something in CIME5 is causing the timestamps to be updated. It is unclear from these experiments whether timestamps are being reset outside of ./checkpoints/ as well (e.g. in the CaseDocs subdirectory) - I'll investigate further if it is not obvious to you. Note that these timestamps provide useful performance and provenance information.
In general, if files are being copied or renamed, could we use '--preserve=timestamps' to keep the timestamps? We should change mach_syslog to use '--preserve=timestamps' also. If you are willing to do this, that will be great. Otherwise I'll try it out once we figure out why timestamps are being changed in the checkpoints subdirectory, as I won't be able to check whether it is doing the right thing until then.
c) Question: The 'top level' has build_environment.txt.$lid.gz . In CaseDocs.$lid there is also a (new) file software_environment.txt.%lid.gz . Are these identical? I know that we will soon be adding run_environment.txt.$lid.gz, but I do not believe that this is in there yet, correct?
@worleyph
Permissions are wrong.
OK, I will set the umask at the beginning of the run.
Some of the timestamps have been reset
Ah, this is probably because I'm using shutil.copy instead of cp -p. I can fix.
Question: The 'top level' has build_environment.txt.$lid.gz . In CaseDocs.$lid there is also a (new) file software_environment.txt.%lid.gz . Are these identical? I know that we will soon be adding run_environment.txt.$lid.gz, but I do not believe that this is in there yet, correct?
I think one comes from case.setup and one comes from case.build. run_env is in CIME5.2 but not yet in ACME.
Note, the performance_archive structure could be improved. Don't want to hold this PR back, and I am not sure what coordination is needed with CESM (if any). At the moment I have added things to CaseDocs that do not appear there in the user's case directory. We should probable have a BuildDocs subdirectory for this extra information, as well as some things currently at the top level like build_environment and GIT_DESCRIBE and SourceMods. We could even have a RunDocs, but that is everything else, so may not add any utility.
I have no experience with setting umask globally (if that is what your comment meant), but I'm guessing that cp -p might override this, and that you want to use '--preserve=timestamps' instead?
Is " performance_archive structure could be improved" a new feature? Yes lets save that for a future PR. This PR should get the beta0 version of performance archiving working in cime5 as best as it can.
@rljacob , yes, sorry to have even brought this up.
@worleyph I think I've addressed your concerns
@jgfouca , the timestamps still are not being preserved in the checkpoints subdirectory. Since many of these are created in the checkpoints directory by mach_syslog, some postprocessing is changing the timestamp? gzip doesn't do this when using /usr/bin/gzip. Are you using a python utlity to compress these? Is there anything else going on with CIME5 and the checkpoints subdirectory?
Thanks.
@worleyph yes, I'm using a python gzipper. I'll change it to preserve the timestamp of the original file.
@worleyph OK try now. Sorry for the trial-and-error approach we've had to take with this PR; it's hard to test the correctness of this stuff.
@jgfouca ,
OK try now.
Titan is backed up today ... may take awhile to get runs through. I don't want to start testing on other systems until I verify that everything is working okay on Titan (for my sanity's sake), but I expect everything to work fine this time.
Sorry for the trial-and-error approach we've had to take with this PR; it's hard to test the correctness of this stuff.
I very much appreciate your patience with this process - it is what I expected. I was able to be much more involved, working with Jeff Johnson, for the port to CIME2, but that was not an option this time. The 'fix' to the permissions issue, even though seemingly minor, will remove a major headache going forward. I hope that it works as expected at NERSC, but will verify as soon as I get the chance.
Everything seems to be working fine on Titan. I'll start trying out the other platforms.
@jgfouca , testing on cori-haswell, job died with:
ERROR: Command: 'git describe > /project/projectdirs/acme/performance_archive/worleyph/A_WCYCL2000_ne30_oEC_PR1201_corihaswell3/170110-193504/GIT_DESCRIBE.170110-193504' failed with error 'fatal: Not a git repository (or any parent up to mount point /global/u2/w/worleyph/ACME/jgfouca/cime/perf_data_fixes/ACME/cime)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).' from dir '/global/u2/w/worleyph/ACME/jgfouca/cime/perf_data_fixes/ACME/cime'
@jgfouca , testing on Edison looks good, except for one minor anomaly. The last files copied into the checkpoint subdirectory in performance_archive were not gzipped. I am guessing that the mach_syslog was not killed until after the files in this subdirectory were gzipped, so a few more copied over. Don't know if the kill was simply delayed, and this was a rare event, or whether the order of operations is different in CIME5 than in CIME2. Since it works on Titan, guess that it is the former, but would like for you to confirm.
@worleyph this problem has been fixed by a later branch. The git describe problem I mean.
@jgfouca , Can I back port this (so that I can test) your branch? What would I look for? (Is this a CIME branch or an ACME branch?)
@worleyph I will rebase this branch, just a moment.
@worleyph Done.
@jgfouca , cori-haswell does not create a checkpoints subdirectory (so there is also nothing in it, and can't test mach_syslog until the subdirectory is created).
There are also some other missing items ... sqsw, sqsf, sqsf_jobid, and squeuef .
Thanks.
Looks like the name change from corip1 to cori-haswell is not in there yet? Perhaps this is would be a second PR, but I'd really like this PR to get everything working on the current target platforms, which include cori-haswell I believe.
No, my branch for allowing ACME to run on cori-haswell and cori-knl has not yet made it. I'd prefer to get it in asap, but it was suggested I get all the tests to pass first.
@ndkeen, okay - then you'll need to update provenance.py as well. And you'll want me to test the provenance capture capability. @jgfouca , would you prefer to ignore cori-haswell (and cori-knl) for now or go ahead and get this part working.
@ndkeen since the testing is taking a while, go ahead and post the PR for cori-haswell and cori-knl.
OK, I will rebase this branch onto master once Noel's branch has been merged.
@jgfouca , it appears the that 'kill mach_syslog' logic in provenance.py is not working on Edison. I added some logger.warning calls:
# Kill mach_syslog
logger.warning("PW: Before kill syslog")
job_id = _get_batch_job_id_for_syslog(case)
if job_id is not None:
logger.warning("PW: In kill syslog")
syslog_jobid_path = os.path.join(rundir, "syslog_jobid", ".%s" % job_id)
if os.path.exists(syslog_jobid_path):
logger.warning("PW: In kill syslog part 2")
try:
with open(syslog_jobid_path, "r") as fd:
syslog_jobid = int(fd.read().strip())
os.kill(syslog_jobid, signal.SIGTERM)
logger.warning("PW: Killed syslog")
except (ValueError, OSError) as e:
logger.warning("Failed to kill syslog: %s" % e)
finally:
os.remove(syslog_jobid_path)
# copy/tar timings
logger.warning("PW: After kill syslog")
and output was
2017-01-12 06:19:01 MODEL EXECUTION HAS FINISHED
PW: Before kill syslog
PW: In kill syslog
PW: After kill syslog
check for resubmit
so it looks like the 'if os.path.exists(syslog_jobid_path):' test is failing. I'll check whether the same issue occurs on TItan.
Note that the explicit kill is due to a past problem (can't remember the system) in which the background job kept the job scripts running until time ran out, and sometimes was left as an orphan that caused performance issues on the job submission node (as these accumulate over time).
@worleyph I don't understand the history of the syslog stuff. Does the lack of syslog_jobid_path mean that it does not need to be killed? At first glance, that is how I would interprety this code. Did you see that syslog needed to be killed but wasn't?
@worleyph on an unrelated note. I've pushed a commit to this branch to address the concern you raised on slack regarding batch job output files.
@jgfoucar.
Did you see that syslog needed to be killed but wasn't?
Yes. I'll look back at the CIME2 implementation, and but may need you to walk me through the CIME5 implementation to understand why they are different (if they are).
Original code was
if ($app_jobid != 'unknown') then
set syslog_jobid = 0
if (-e $run_dir/syslog_jobid.$app_jobid) then
set syslog_jobid = `cat $run_dir/syslog_jobid.$app_jobid`
/bin/rm -f $run_dir/syslog_jobid.$app_jobid
endif
if ($syslog_jobid != 0) then
if { kill -0 $syslog_jobid } then
kill $syslog_jobid
endif
endif
endif
The new code is
# Kill mach_syslog
job_id = _get_batch_job_id_for_syslog(case)
if job_id is not None:
syslog_jobid_path = os.path.join(rundir, "syslog_jobid", ".%s" % job_id)
if os.path.exists(syslog_jobid_path):
try:
with open(syslog_jobid_path, "r") as fd:
syslog_jobid = int(fd.read().strip())
os.kill(syslog_jobid, signal.SIGTERM)
except (ValueError, OSError) as e:
logger.warning("Failed to kill syslog: %s" % e)
finally:
os.remove(syslog_jobid_path)
So, how would I add a logger.warning to print out the syslog_jobid_path, to see why it is not finding the syslog_jobid file? Something like
logger.warning("syslog_jobid_path= %s" %syslog_jobid_path)
?
Note that this writing out the syslog_jobid and then reading it back in was necessitated by the approach taken to bolt this onto the CIME2 design, introducing pre and post run scripts with and without ACME customizations. (It used to all be in the run script, and information was not lost between the mach_syslog spawn and the need to kill it) If you can capture and preserve the mach_syslog id in a less hacky way, that would be preferred.
If you can capture and preserve the mach_syslog id in a less hacky way, that would be preferred.
Sorry @rljacob - future customization, unless we can't get the current implementation working.
You probably want logger.info("syslog_jobid_path= %s" %syslog_jobid_path) as logger.warning is off by default
@mfdeakin-sandia incorrect, logger.debug is off by default, logger.warning is even higher level than logger.info.
/scratch1 is down on Edison at the moment - I'll test tomorrow when it comes back. I'll look at Titan in the meantime.
@jgfouca , question about your latest commit ... you added a timestamp, changing
self.output_error_path = self.job_id
to
self.output_error_path = self.job_id + get_timestamp()
The problem on Edison (and Cori?) is that the job_id was missing from the job output file name. Don't know if this is the same as the output_error_path. The job_id is what I was requesting, unless it is not available on Edison? Timestamp will generate unique files, but seems a little less useful than jobid.
@jgfouca , what happens in
os.kill(syslog_jobid, signal.SIGTERM)
if syslog_jobid is not a valid PID, either because it already ended/died or is someone else's process? This was an issue on Mira for the csh implementation, which is why I wrapped it with
if { kill -0 $syslog_jobid } then
Will python continue on without killing the postprocessing script?
@jgfouca , just printed out syslog_jobid_path for an example run on Titan (shortened to be readable)
syslog_jobid_path= xxx/run/syslog_jobid/.3173421
via
logger.warning("syslog_jobid_path= %s" %syslog_jobid_path)
Note the error. Needed to be xxx/run/syslog_jobid.3173421 .
@worleyph we don't have the jobid yet when this gets set up.
@worleyph we might need a more sophisticated fix. I'll look into your other issues as well.
@jgfouca : "@worleyph we don't have the jobid yet when this gets set up." I don't know what we are talking about here. The bug was simply the addition of a '/' when building the string syslog_jobid_path. Can't tell where this is coming from since I do not know what os.path.join is supposed to be doing.
@worleyph, I've added the jobid to slurm output files. I didn't realize how easy this was (just needed %j in the output name). I've revert the timestamp hack.
@worleyph I've fixed the syslog_jobid path. See if things work now.
@worleyph , the branch has been rebased and should be ready to test.
Thank you. I'll give it a whirl.
Can someone summarize which machines this has been verified on? I know Titan is ok. What else?
PR #1222 was just now merged to master so this probably needs to be rebased again to work on cori.
@rljacob Done.
@jgfouca , titan experiments worked; edison experiments worked.
For Cori, provenance.py still refers to corip1, and the syslog.corip1 needs to be renamed as syslog.cori-haswell and syslog.cori-knl . I'll make these changes locally and try out on cori-haswell. Could you please make the same updates to this PR (after I report back that this was sufficient)?
@jgfouca , cori-haswell worked, once I made the indicated fixes to provenance.py and renamed syslog.corip1. However, cori-knl experiment died with the error:
ERROR: Command: 'git describe > /project/projectdirs/acme/performance_archive/worleyph/A_WCYCL2000_ne30_oEC_PR1201_cori-knl1/170120-205944/GIT_DESCRIBE.170120-205944' failed with error '/bin/sh: git: command not found' from dir '/global/u2/w/worleyph/ACME/jgfouca/cime/perf_data_fixes/ACME'
1) Timing directory in RUNDIR needs to have lid appended 2) Create timing.$LID.saved 3) Do not include full directory hierarchy in tar files 4) Copy all timing files to performance archive 5) Restore support for mach_syslog 6) In the case timing subdirectory, stop compressing the performance summary file and start compressing the raw global statistics timing data. 7) For create_test, if save_timing is false, be sure to set SAVE_TIMING to False
Fixes #1177
[BFB]