Closed TomGlanzman closed 7 years ago
@drphilmarshall , no, the run 3.4 visits are not yet included. The 15 time-outs (to date) are, with one exception, all in Run 3.3. It is likely that some fraction of the currently running 611 jobs will also time-out (some were resubmitted after timing out). There are a total of 21,970 visits in the current workflow definition. My best guess is that the list of time-outs will look something like this:
Twinkles | #time-outs | total visits |
---|---|---|
Run 3.1 | 4 | 1508 |
Run 3.1b | 2 | 329 |
Run 3.2 | ? | 2104 |
Run 3.3 | 16+? | 18029 |
Run 3.4 | n/a | 2109 |
I am not aware of any significant resources that allow unlimited run time. There is an "idle" queue in batch, but those jobs are expected to checkpoint, if needed. There are a few public interactive machines, but running phoSim would be discovered by 'ranger' after a day and nasty emails would result. Also, it would be anti-social to run a bunch of long-running phosim jobs on public interactive machines. There are also desk-top machines scattered about, but probably not enough and their owners would not be pleased to share their machines with the resource hungry phosim jobs. I think we must wait either for multi-threading and/or checkpointing.
Good, thanks Tom! When all the jobs are either complete or timed out, let's report to John on the full list of "long jobs" then, and see where Team PhoSim are with the multithreading. Perhaps we could be beta testers!
On Tue, Nov 29, 2016 at 10:14 AM, Tom Glanzman notifications@github.com wrote:
@drphilmarshall https://github.com/drphilmarshall , no, the run 3.4 visits are not yet included. The 15 time-outs (to date) are, with one exception, all in Run 3.3. It is likely that some fraction of the currently running 611 jobs will also time-out (some were resubmitted after timing out). There are a total of 21,970 visits in the current workflow definition. My best guess is that the list of time-outs will look something like this:
| Run 3.1 | 4 | | Run 3.1b| 2 | | Run 3.2 | ? | | Run 3.3 | 16+?|
I am not aware of any significant resources that allow unlimited run time. There is an "idle" queue in batch, but those jobs are expected to checkpoint, if needed. There are a few public interactive machines, but running phoSim would be discovered by 'ranger' after a day and nasty emails would result. Also, it would be anti-social to run a bunch of long-running phosim jobs on public interactive machines. There are also desk-top machines scattered about, but probably not enough and their owners would not be pleased to share their machines with the resource hungry phosim jobs. I think we must wait either for multi-threading and/or checkpointing.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/Twinkles/issues/369#issuecomment-263651653, or mute the thread https://github.com/notifications/unsubscribe-auth/AArY92QEICN2his_ei-YwwgdcRgRkWv0ks5rDGuMgaJpZM4KeUxB .
The first batch of visits for Run3 (runs 3.1, 3,1b, 3.2 and 3.3) are essentially complete. [There is a single active run, but it is a redo of a previously successful job which is being redone only to recreate the instanceCatalog/SEDs that were mistakenly blown away by the SLAC scratch disk space cleaner-upper.]
The first few visits have been reprocessed using the new multi-threaded phoSim (v3.6.0) and reported on in issue #420 . Please consider helping to validate the output from these test runs which can be compared directly with the Twinkles Run3 data.
Here's a summary plot of the CPU time vs. moon altitude for the Run 3 visits. The color coding is by filter, and the symbol size relates to the phase of the Moon. The plot has 21911 points; 59 of the Run 3 visits reached the CPU time limit and are not shown.
I don't think that I see any surprises relative to Twinkles Run 1. The fraction of visits near the CPU time limit is very much smaller now, because visits with predicted CPU times greater than 100 hours were not run. It would be interesting to understand the visits that took a lot of CPU time even though the Moon was down.
For reference here is the equivalent plot for Twinkles Run 1, when it was almost complete. The individual points were made larger in this plot because it is less crowded.
@rbiswas4, I've posted a csv file with metadata for the Run 3 runs along with predicted and actual CPU times here. For the 59 visits that reached the 120-hour CPU time limit, the entry in the 'cpu' column is -999. The 'host' column contains a string identifying the class of batch host computer. (The CPU time predictions are for the 'fell' class.)
There's more to look at but here's the comparison of actual and predicted (fell) CPU times. The points are color coded by batch host type: bullet (orange), deft (green), dole (cyan), fell (purple), hequ (red), kiso (blue). The line y = x applies to the purple points. I have a few empirical scale factors based on repeated Run 1 visits. For example, the lower line (scaled by a factor of 1.71) is my guess from Run 1 what the hequ (red) times should be.
@TomGlanzman @sethdigel I'm going to close this, since Tom has emerged victorious from the bulk of the Run 3 processing and we have the long jobs issued elsewhere :-) ( #420 ). This thread has some good plots in it, for a phosim workflow note. We should be able to find them by searching the issues though.
This is where news of the Twinkles Run3 phoSim production will appear.
First news item:
The first 40 run3 test runs are happening now. The SLAC Pipeline task is TW-phoSim-r3 and progress may be monitored at this link. At this moment, 19 of 40 are complete; the remaining runs should be complete by Monday morning.
This task includes the latest dynamically generated instanceCatalogs and SED files from Rahul. It also makes use of the SLAC lustre high-performance file system in an effort to avoid the I/O issues observed in run1 (due to phoSim's creation of its /work directory).
These first 40 are tests. It would not be unreasonable to redo them if a configuration or other issue arises. Data may be searched in the data catalog: http://srs.slac.stanford.edu/DataCatalog/folder.jsp?folder=16123472, or directly at SLAC, /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3/phosim_output. From there, dig down into the subdirectories to the "output" directory and you will find two files per visit: lsste<visit#>fN_E000.fits.gz and the associated centroid file.
Please post operational questions/concerns here.