Closed TomGlanzman closed 6 years ago
Steady running the past 24 hours. A large block of SLURM jobs timed out yesterday morning after 48 hours dropping production from over 1,000 to only ~320 nodes, but since then have ramped back up to >600 nodes.
07:45 status:
10:05 An urgent patch was made to one of the workflow scripts which allows one automatic retry of a phosim step (trim, raytrace, e2adc). This was done because there is a low-level of transient failures to read the focalplanelayout.txt file (a phoSim config file), and seems to be correlated with a new node being added to the workflow. Because 'raytrace' and 'e2adc' occur within the same workflow step, a failure in 'e2adc' will require the 'raytrace' step to be repeated. The retry will sleep for 30 seconds, then try the same command again. A second phoSim step failure will cause the entire workflow step to fail (as it does now after a single failure).
Not a great 24 hours, but not due lack of resources. Some hiccups due to script changes. A recommended change to query machine memory unexpectedly caused a bash error which in turn caused a block of failures until the change was pulled out. Bah.
08:05 status:
09:00 - production halted for workflow repair & testing 17:30 - production restarted (Pilot jobs released from user hold).
All systems running smoothly. Yesterday's "maintenance day" appears to have been successful. Cori memory is now routinely recorded and the transient read error of 'focalplanelayout.txt' will now be automatically retried.
09:45 status:
A smooth 24 hours, but a problem associated with the SLAC workflow engine has appeared: a large number of jobs (~30k) are stuck in a "QUEUED" state and are not being released for processing. Experts notified.
13:40 status:
18:00: QUEUED problem mentioned above appears to have been solved π
A smooth 24 hours which saw a record number of cori nodes, however briefly, involved with this project: 2,050 -- which is about 20% of the entire Cori KNL machine.
14:10 status:
Note: Assuming the DC2 Run 1.2 phoSim run has been the overwhelming bulk of work on Cori over the past couple of weeks, an updated guess as to the amount of NERSC time needed to complete is now more like (17M [time used so far]/56%) = 30M. We may get a 'refund' of 2M hours due to the Friday the 13th meltdown. We also lost some efficiency during the "maintenance day" earlier in the week. Hopefully we can beat the 30M hours.
As expected, a large number of Pilot jobs timed out Saturday night, representing the large block of Cori nodes obtained Thursday evening which ran for 48 hours. Interestingly, a large chunk of the terminated jobs are long-running, meaning they cannot be rolled back as they will simply run out of time yet again. So they are being added to the ever growing collection that must be dealt with using a NERSC "reservation", once the shorter jobs have completed. Given the fact that NERSC desires a 4-day lead time (or more) for a reservation and that the supply of shorter jobs is dwindling, I will be making a reservation request tomorrow after consulting with Wahid, our current NERSC contact.
09:25 status:
Yesterday, our NERSC usage passed a milestone of sorts: our usage now exceeds the "uniform usage" projection, i.e., the amount of allocation used if consumption distributed evenly over the allocation year. (Note that Run 1.2p began on 5 Apr 2018.)
10:06 update: Here is the current status of Run 1.2p visits broken down by field and band:
Band | Survey | #Visits | #Visits complete |
---|---|---|---|
u | WFD | 67 | 57 |
g | WFD | 91 | 91 |
r | WFD | 245 | 168 |
i | WFD | 223 | 98 |
z | WFD | 247 | 71 |
y | WFD | 252 | 10 |
u | uDDF | 192 | 112 |
g | uDDF | 138 | 45 |
r | uDDF | 138 | 9 |
i | uDDF | 137 | 24 |
z | uDDF | 136 | 49 |
y | uDDF | 135 | 47 |
- | TOTAL | 2001 | 781 |
NERSC is performing maintenance on the 'projecta' group disk storage system today from 0800-1400. Because this space is explicitly called out as a requirement for all phoSim jobs, this explains why no new Pilots jobs were dispatched since Saturday morning. As a result of this outage, Run 1.2p is temporarily halted.
NOTE: There are no more 'pending' sensor-visits; all 158,766 sensor-visits have now been attempted at least once. There remains a large number of 'terminated' sensor-visits, some of which (>30k jobs) ran for >48 hours and the remainder terminated when their Pilot jobs ran out of time in the SLURM queue.
08:20 status:
Hi Tom. Great; can you help me parse one thing?
Since you say that all visits have now been attempted and 63% are complete, that means that ~40% of the sensor visits needed to be repeated. What fraction of the 30M hours were "wasted" in this way? I assume it is a smaller number since many jobs never really started etc. Do we know that that "CPU efficiency" number actually is?
Tom's RESPONSE:
To clarify, Run 1.2p has consumed ~20M NERSC hours to date (~10M hours had been charged to our allocation prior to the commencement of Run 1.2p). My guess is that the inefficiency is somewhere between 25-40%. To date, ~250k jobs have been rolled back -- for all reasons (e.g., time-out terminations, file system failures, machine crashes, intentional kills). I do not have a good tool for providing a more precise answer at this time.
@cwwalter Just to clarify, the 30M is all of the allocation what we have used so far, not just Run 1.2p.
After yesterday's storage outage, 500 nodes started up and they were able to accommodate all 15,590 remaining rolled back jobs (i.e., jobs that terminated due to Pilot time-out, but had not yet run for 48 hours). The scale of the "extra long" jobs is now known: ~44,000 (or about 28% of the total number of sensor-visits). I have just started negotiating with Wahid at NERSC for a >48 hour Cori-KNL reservation to handle these jobs.
07:55 status:
DC2 Run 1.2p has reached a pause. All sensor-visits that can complete within 48 hours (maximum queue time at NERSC) have completed. The remaining sensor-visits will be attempted during a special 7-day 'reservation' which is scheduled to commence next Tuesday, 1 May 2018 at 10:00 Pacific. With luck (depending on how long these sensor-visits actually take to complete), we could possibly complete Run 1.2p during this reservation. If not, we will need to schedule a follow on mop-up reservation.
Here are the numbers:
Next news update Tuesday, 1 May 2018.
Snapshot of the "blue" DC2-Run 1.2p status page, where statistics on the long-running sensor-visits may be found (timed-out jobs are tallied in the "TERMINATED" column):
2018-04-26 08:30:03.770866 Pacific
(Note: jobs in the "QUEUED" state are not tallied, and the "ALL" column is currently unreliable)
=======Task and (batch) Process-step Rollup==============================
+-----------+-----------+-----------+-----------+----------+--------------+-------+
| WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------+-----------+-----------+-----------+----------+--------------+-------|
| 0 | 0 | 0 | 151809 | 0 | 53980 | 0 |
+-----------+-----------+-----------+-----------+----------+--------------+-------+
=======Task Rollup===============================================
+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| setupVisit | 0 | 0 | 0 | 2001 | 0 | 0 | 0 |
| RunTrim | 0 | 0 | 0 | 44022 | 0 | 0 | 0 |
| RunRaytrace | 0 | 0 | 0 | 104786 | 0 | 53980 | 0 |
| finishVisit | 0 | 0 | 0 | 1000 | 0 | 0 | 0 |
+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
======================================================
Total rolled back process steps = 249645
=======Detailed Task Summary======================================
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-u | setupVisit | 0 | 0 | 0 | 67 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | RunTrim | 0 | 0 | 0 | 1474 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | RunRaytrace | 0 | 0 | 0 | 4473 | 0 | 348 | 0 |
| DC2-R1-2p-WFD-u | finishVisit | 0 | 0 | 0 | 58 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-g | setupVisit | 0 | 0 | 0 | 91 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | RunTrim | 0 | 0 | 0 | 2002 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | RunRaytrace | 0 | 0 | 0 | 6118 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | finishVisit | 0 | 0 | 0 | 91 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-r | setupVisit | 0 | 0 | 0 | 245 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | RunTrim | 0 | 0 | 0 | 5390 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | RunRaytrace | 0 | 0 | 0 | 15912 | 0 | 2562 | 0 |
| DC2-R1-2p-WFD-r | finishVisit | 0 | 0 | 0 | 185 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-i | setupVisit | 0 | 0 | 0 | 223 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | RunTrim | 0 | 0 | 0 | 4906 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | RunRaytrace | 0 | 0 | 0 | 9469 | 0 | 6774 | 0 |
| DC2-R1-2p-WFD-i | finishVisit | 0 | 0 | 0 | 102 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-z | setupVisit | 0 | 0 | 0 | 247 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | RunTrim | 0 | 0 | 0 | 5434 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | RunRaytrace | 0 | 0 | 0 | 7880 | 0 | 10219 | 0 |
| DC2-R1-2p-WFD-z | finishVisit | 0 | 0 | 0 | 82 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-y | setupVisit | 0 | 0 | 0 | 252 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | RunTrim | 0 | 0 | 0 | 5544 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | RunRaytrace | 0 | 0 | 0 | 2122 | 0 | 15903 | 0 |
| DC2-R1-2p-WFD-y | finishVisit | 0 | 0 | 0 | 17 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-u | setupVisit | 0 | 0 | 0 | 192 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | RunTrim | 0 | 0 | 0 | 4224 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | RunRaytrace | 0 | 0 | 0 | 16593 | 0 | 270 | 0 |
| DC2-R1-2p-uDDF-u | finishVisit | 0 | 0 | 0 | 154 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-g | setupVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | RunTrim | 0 | 0 | 0 | 3036 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | RunRaytrace | 0 | 0 | 0 | 7901 | 0 | 4234 | 0 |
| DC2-R1-2p-uDDF-g | finishVisit | 0 | 0 | 0 | 72 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-r | setupVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | RunTrim | 0 | 0 | 0 | 3036 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | RunRaytrace | 0 | 0 | 0 | 8227 | 0 | 3902 | 0 |
| DC2-R1-2p-uDDF-r | finishVisit | 0 | 0 | 0 | 50 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-i | setupVisit | 0 | 0 | 0 | 137 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | RunTrim | 0 | 0 | 0 | 3014 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | RunRaytrace | 0 | 0 | 0 | 8996 | 0 | 3047 | 0 |
| DC2-R1-2p-uDDF-i | finishVisit | 0 | 0 | 0 | 63 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-z | setupVisit | 0 | 0 | 0 | 136 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | RunTrim | 0 | 0 | 0 | 2992 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | RunRaytrace | 0 | 0 | 0 | 8896 | 0 | 3034 | 0 |
| DC2-R1-2p-uDDF-z | finishVisit | 0 | 0 | 0 | 55 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-y | setupVisit | 0 | 0 | 0 | 135 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | RunTrim | 0 | 0 | 0 | 2970 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | RunRaytrace | 0 | 0 | 0 | 8199 | 0 | 3687 | 0 |
| DC2-R1-2p-uDDF-y | finishVisit | 0 | 0 | 0 | 71 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
Elapsed time for workflow database queries = 25.947135 sec
[Later - Oops, I was a little optimistic with the above screen dump. When all the dust settled (due in part to an operator error on my part), there are actually only 907 fully complete visits.]
Updated numbers:
2018-04-27 08:50:03.515735 Pacific
(Note: jobs in the "QUEUED" state are not tallied, and the "ALL" column is currently unreliable)
=======Task and (batch) Process-step Rollup==============================
+-----------+-----------+-----------+-----------+----------+--------------+-------+
| WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------+-----------+-----------+-----------+----------+--------------+-------|
| 0 | 60642 | 0 | 145054 | 0 | 0 | 0 |
+-----------+-----------+-----------+-----------+----------+--------------+-------+
=======Task Rollup===============================================
+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| setupVisit | 0 | 0 | 0 | 2001 | 0 | 0 | 0 |
| RunTrim | 0 | 0 | 0 | 44022 | 0 | 0 | 0 |
| RunRaytrace | 0 | 60642 | 0 | 98124 | 0 | 0 | 0 |
| finishVisit | 0 | 0 | 0 | 907 | 0 | 0 | 0 |
+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
======================================================
Total rolled back process steps = 310380
=======Detailed Task Summary======================================
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-u | setupVisit | 0 | 0 | 0 | 67 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | RunTrim | 0 | 0 | 0 | 1474 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | RunRaytrace | 0 | 348 | 0 | 4473 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | finishVisit | 0 | 0 | 0 | 58 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-g | setupVisit | 0 | 0 | 0 | 91 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | RunTrim | 0 | 0 | 0 | 2002 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | RunRaytrace | 0 | 0 | 0 | 6118 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | finishVisit | 0 | 0 | 0 | 91 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-r | setupVisit | 0 | 0 | 0 | 245 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | RunTrim | 0 | 0 | 0 | 5390 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | RunRaytrace | 0 | 2562 | 0 | 15912 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | finishVisit | 0 | 0 | 0 | 185 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-i | setupVisit | 0 | 0 | 0 | 223 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | RunTrim | 0 | 0 | 0 | 4906 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | RunRaytrace | 0 | 6774 | 0 | 9469 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | finishVisit | 0 | 0 | 0 | 102 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-z | setupVisit | 0 | 0 | 0 | 247 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | RunTrim | 0 | 0 | 0 | 5434 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | RunRaytrace | 0 | 15350 | 0 | 2749 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | finishVisit | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-y | setupVisit | 0 | 0 | 0 | 252 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | RunTrim | 0 | 0 | 0 | 5544 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | RunRaytrace | 0 | 17434 | 0 | 591 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | finishVisit | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-u | setupVisit | 0 | 0 | 0 | 192 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | RunTrim | 0 | 0 | 0 | 4224 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | RunRaytrace | 0 | 270 | 0 | 16593 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | finishVisit | 0 | 0 | 0 | 154 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-g | setupVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | RunTrim | 0 | 0 | 0 | 3036 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | RunRaytrace | 0 | 4234 | 0 | 7901 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | finishVisit | 0 | 0 | 0 | 72 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-r | setupVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | RunTrim | 0 | 0 | 0 | 3036 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | RunRaytrace | 0 | 3902 | 0 | 8227 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | finishVisit | 0 | 0 | 0 | 50 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-i | setupVisit | 0 | 0 | 0 | 137 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | RunTrim | 0 | 0 | 0 | 3014 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | RunRaytrace | 0 | 3047 | 0 | 8996 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | finishVisit | 0 | 0 | 0 | 63 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-z | setupVisit | 0 | 0 | 0 | 136 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | RunTrim | 0 | 0 | 0 | 2992 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | RunRaytrace | 0 | 3034 | 0 | 8896 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | finishVisit | 0 | 0 | 0 | 55 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-y | setupVisit | 0 | 0 | 0 | 135 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | RunTrim | 0 | 0 | 0 | 2970 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | RunRaytrace | 0 | 3687 | 0 | 8199 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | finishVisit | 0 | 0 | 0 | 71 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
Elapsed time for workflow database queries = 30.031743 sec
A random data point. Looking at a single (random) WFD y-band run that consumed ~41 hours clock time, here is the roll-up by photon source category of: number of photons and elapsed time to simulate them.
Type | Tot photons | Frac phot | Tot time(s) | % time |
---|---|---|---|---|
Astro Objects | 655,711,741 | 2.66E-04 | 3443.15 | 2.34% |
Zodiacal | 71,261,379,555 | 0.03 | 15020.15 | 10.21% |
Moon-Mie | 422,991,262,780 | 0.17 | 16034.5 | 10.90% |
Moon-Rayleigh | 1,342,127,786,118 | 0.54 | 28192.24 | 19.16% |
Airglow | 603,463,483,218 | 0.24 | 78302.19 | 53.22% |
Cosmic Rays | 23,524,903,800 | 0.01 | 6142.38 | 4.17% |
Total | 2,464,024,527,212 | 1.00 | 147134.61 | 100% |
A ~7-day, 720-node reservation on Cori-KNL begins today. During this reservation, all remaining sensor-visits will be attempted.
$ scontrol show res desc-run1.2p
ReservationName=desc-run1.2p
StartTime=2018-05-01T10:00:00
EndTime=2018-05-08T06:59:00
Duration=6-20:59:00
Nodes=nid[02688-02703,02708-02763,02768-02769,02771-02827,02836-02891,02896-02924,02926-02959,02964-03002,03004-03011,03013-03019,03024-03056,10743,12288-12456,12458-12599,12601-12671]
NodeCnt=720
CoreCnt=48960
Features=knl
PartitionName=regular
Flags=
TRES=cpu=195840
Users=glanzman,descpho,desc,descdm
Accounts=m1727
Licenses=(null)
State=INACTIVE
BurstBuffer=(null)
Watts=n/a
12:45 status: Off to a rough start. Immediately at 10:00 700+ pilot jobs began running. But it seems that the default version of python at NERSC moved from v2 to v3 very recently. The workflow python is v2, so jobs began crashing in massive numbers - and very rapidly. Due to the load on the workflow engine, this disaster was not apparent for ~5 minutes into the run. By 10:17, .py scripts had all been patched and put in to production. Even now, almost 2-1/2 hours later, the workflow engine is still catching up. But on the positive side, there are >24k jobs running on >700 nodes.
Jobs on all 720 nodes were running smoothly until this morning when a cluster of terminated jobs suddenly appeared. After some investigation, the only clue from the Pilot is that 'srun' claimed an out-of-memory problem. (Note, I have never seen an indication of a memory leak with phoSim and its memory foot print is very stable.) This problem affected exactly 10 nodes, or 671 raytrace jobs (well, as of 11:30 PDT). I have modified the Pilot parameters to allow slightly less memory to be allocated to see if that has any effect, and then resubmitted 10 1-node Pilot jobs.
11:40 status:
15:45 status: Another six pilots have spontaneously terminated. Cause is under investigation...
The long-running jobs continue to make progress and we are now clearly seeing positive results of running within a 7-day reservation.
The problem of Pilots crashing is now suspected to be memory exhaustion on the KNL nodes. Ongoing discussion as to whether this is memory consumption by user code (workflow Pilot and/or phoSim raytrace), or a system process, or some other problem on the afflicted node. Some memory monitoring has been installed in the workflow. To date, approx. 6% of the Pilot jobs have crashed.
A second problem arose when the 'descpho' pseudo-account's sub-allocation became exhausted, preventing any new jobs from being submitted. Thanks to Richard for quickly fixing!
13:55 status:
Smooth running since Friday. We are now slightly less than two days away from completion of the 7-day reservation and it is possible to make a prediction. Of the >60,000 sensor-visits in queue last Tuesday morning, approx 3900 have not yet started to run, therefore it is highly unlikely they will complete before the upcoming Tuesday cut-off. This represents a lower limit to the estimate of uncompleted jobs, as some fraction of the currently running jobs may not complete.
07:10 status:
15:10 update: A new round of spontaneous Pilot crashes has started. At this time, six (6) Pilot jobs have been killed in the past hour or so.
20:20 update: The latest round of Pilot failures, which are continuing, appear to be correlated to communications problems between the Pilot and the jobControlDaemon. Experts notified.
The 7-day/700-node reservation is beginning to wind down in preparation for tomorrow morning's completion. A summary of the entire reservation along with a post mortem of the three primary hiccups encountered over the past week will be reported in this log tomorrow.
09:00 update:
The 7-day/700-node reservation ended this morning. (This log entry will be updated later today.)
07:30 status:
Reservation summary:
60642 (mostly long-running) sensor-visits remaining in Run 1.2p prior to the reservation
50754 sensor-visits completed during the reservation (84% of target)
There were basically four hiccups during this reservation that resulted in some lost time.
Python3 was unintentionally being invoked at the start
A cluster of "out-of-memory" conditions terminated 6% of running Pilots
the descpho account ran out of its sub-allocation
episodes of RMI communications problems between Pilot and PilotJobControl daemon
The first problem cropped up for unknown reasons but the fix was trivial (a "module load" of the correct python version). In retrospect, limited episodes of this symptom occurred the week previously but I did not understand the root cause at the time.
The second problem continues to be under investigation via a NERSC ticket (although I am not hopeful we will ever gain a proper understanding of the root cause -- or the solution). Fortunately, this problem is episodic and at a fairly low level. It seems to disappear when the experts focus attention on it. I have added some memory monitoring to all phases of the operation, including the Pilot, and all phosim executable which now report their memory high-water marks. Given the rock-steady memory use of phosim raytrace (~2.2 GB) and the minimal memory consumption of the Pilot (<200 MB), I cannot understand how an aggregate memory consumption of ~71 GB could ever get near the ~94 GB limit on Cori-KNL. This remains a mystery.
The third problem was quickly solved, once identified. In the future, the account's sub-allocation should be monitored via NERSC's "NIM" web application.
The fourth problem is being investigated by Tony Johnson. It is hoped that it can eventually be understood and fixed. During the investigation it was noticed that the PilotJobControl daemon was generating log messages at an extremely high rate. This is likely due to the unprecedentedly large number of simultaneously running workflow jobs (>24,000). Tony's time is 150% LSST-Camera, so getting his help on this will be a struggle.
Here is what the Run 1.2p NERSC/Cori utilization looks like so far:
Execution (wall clock) time for the time-consuming raytrace step is presented in 12 plots, one for each of the survey/band components of Run 1.2p. An interesting fact: there are jobs in the uDDF g-band that extend to ~10,000 minutes (~7-days), so there is no reason to believe that all remaining jobs might complete within that period.
Recent work by Adrian Pope and Tom Uram at Argonne motivated the release of phoSim v3.7.14, which has improved thread scaling. Recall that DC2 production -- so far -- has been limited to 8-threads in the raytrace step due to internal code inefficiencies when using more threads [ref work last Fall by Glenn and myself]. As a performance and validation check, a single DC2 Run 1.2p visit has been run with the latest phoSim code. This note summarizes the tests.
The executive summary is that the code changes do, indeed, allow for efficient running of many threads, thereby enabling us to attempt recovering the ~10,000 remaining sensor-visits in DC2 Run 1.2p. There are some differences between the 8-thread (v3.7.9) and 54-thread (v3.7.14) images. Although these differences do seem not qualitatively significant, it would be good if others could take a more quantitative look.
Performance The phoSim v3.7.14 code was built at NERSC for KNL using the Intel compiler, exactly as was 3.7.9. The phoSim command files and the instanceCatalog generation were identical with earlier production running. The selected test visit was 7th in sequence for the WFD y-band survey. This particular visit contains 26 sensor-visits, of which nine (9) were terminated due to running out of time. Whereas the original DC2 run was configured such that each KNL node would run 34 instances of 8-thread raytrace, the v3.7.14 test was configured to run 5 instances of 54-thread raytrace. The visit (obsHistID) is 12449. The following table shows the elapsed clock time for the new and old runs (in hours), followed by the final status of the job step ("TERMINATED" means the Pilot job ran out of time). Finally, the ratio of elapsed times is presented.
Stream 3.7.14 v3.7.9 v3.7.9/v3.7.14
6.13.0 elapsedHours: test= 4.49 SUCCESS , prod= 56.41 SUCCESS , ratio= 12.56
6.13.1 elapsedHours: test= 4.87 SUCCESS , prod= 60.72 SUCCESS , ratio= 12.47
6.13.2 elapsedHours: test= 5.20 SUCCESS , prod= 61.79 SUCCESS , ratio= 11.89
6.14.0 elapsedHours: test= 6.87 SUCCESS , prod= 64.34 SUCCESS , ratio= 9.37
6.18.0 elapsedHours: test= 7.04 SUCCESS , prod= 46.79 TERMINATED, ratio= 6.65
6.18.1 elapsedHours: test= 7.73 SUCCESS , prod= 64.88 SUCCESS , ratio= 8.39
6.18.2 elapsedHours: test= 6.71 SUCCESS , prod= 58.37 SUCCESS , ratio= 8.70
6.18.3 elapsedHours: test= 6.62 SUCCESS , prod= 39.50 TERMINATED, ratio= 5.97
6.18.4 elapsedHours: test= 11.52 SUCCESS , prod= 62.51 SUCCESS , ratio= 5.43
6.18.5 elapsedHours: test= 6.79 SUCCESS , prod= 65.69 SUCCESS , ratio= 9.68
6.18.6 elapsedHours: test= 11.46 SUCCESS , prod= 56.78 SUCCESS , ratio= 4.95
6.18.7 elapsedHours: test= 7.21 SUCCESS , prod= 42.28 TERMINATED, ratio= 5.86
6.18.8 elapsedHours: test= 11.36 SUCCESS , prod= 46.88 TERMINATED, ratio= 4.13
6.19.0 elapsedHours: test= 5.52 SUCCESS , prod= 65.65 SUCCESS , ratio= 11.89
6.19.1 elapsedHours: test= 7.03 SUCCESS , prod= 59.52 SUCCESS , ratio= 8.47
6.19.2 elapsedHours: test= 6.69 SUCCESS , prod= 64.18 SUCCESS , ratio= 9.60
6.19.3 elapsedHours: test= 7.36 SUCCESS , prod= 46.86 TERMINATED, ratio= 6.37
6.19.4 elapsedHours: test= 6.73 SUCCESS , prod= 46.88 TERMINATED, ratio= 6.97
6.19.5 elapsedHours: test= 11.51 SUCCESS , prod= 63.08 SUCCESS , ratio= 5.48
6.19.6 elapsedHours: test= 6.60 SUCCESS , prod= 60.41 SUCCESS , ratio= 9.16
6.19.7 elapsedHours: test= 5.75 SUCCESS , prod= 46.87 TERMINATED, ratio= 8.16
6.20.0 elapsedHours: test= 11.52 SUCCESS , prod= 60.94 SUCCESS , ratio= 5.29
6.20.1 elapsedHours: test= 7.56 SUCCESS , prod= 46.79 TERMINATED, ratio= 6.19
6.20.2 elapsedHours: test= 6.46 SUCCESS , prod= 46.43 TERMINATED, ratio= 7.18
6.20.3 elapsedHours: test= 6.36 SUCCESS , prod= 60.95 SUCCESS , ratio= 9.59
6.20.4 elapsedHours: test= 4.71 SUCCESS , prod= 47.29 SUCCESS , ratio= 10.03
Clearly, the new code successfully allows a large number of threads to be used efficiently. This should, if the images pass a final validation, allow us to complete the remaining ~10,000 or so sensor-visits within the normal NERSC 48-hour batch queue.
Interestingly, the elapsed time ratio is far from constant, ranging from 4 to >12. It is known that random number sequences will be different when the threading changes (and, in fact, even if the threading remains the same, if >1 thread), but a factor of x3 seems excessive. There is not yet an explanation for these differences.
Comparison of results Two simple methods of comparison were carried out:
These tests were applied to a single sensor-visit: R43_S22
The ds9 test involved setting the scale to be the same for the two images, fiddling with the color mapping, and then "blinking" them against each other. The qualitative result is that the basic sky images appear identical (there are the occasional cosmic rays in one image but not the other due to random number differences). The overall sky brightness is slightly different. A view of the two images "tiled" is shown below.
The 'fdiff' tool showed the expected differences, but also some which may not be expected. A full listing of the header differences is presented below. Perhaps John or others can comment on the zernike parameter differences, how these differences come about and whether they are reasonable.
FDIFF V1.1.4
lsst_e_12449_f5_R43_S22_E000.fits.gz[0] < ../../../test-WFD-y/output/000006/lsst_e_12449_f5_R43_S22_E000.fits.gz[0] >
< card 65: AIRMASS = 1.03033619727997
> card 65: AIRMASS = 1.03033587642677
---
< card 71: AZIMUTH = 266.804918891532
> card 71: AZIMUTH = 266.804944703656
---
< card 446: BOD 0 3 = -2.733317E-05
> card 446: BOD 0 3 = 1.230041E-05
---
< card 447: BOD 0 4 = 1.69567E-06
> card 447: BOD 0 4 = 1.162446E-05
---
< card 448: BOD 0 5 = 1.481025E-05
> card 448: BOD 0 5 = 4.973081E-06
---
< card 452: BOD 1 3 = -3.077166E-05
> card 452: BOD 1 3 = -1.925894E-05
---
< card 453: BOD 1 4 = -4.57326E-05
> card 453: BOD 1 4 = 5.064447E-05
---
< card 454: BOD 1 5 = -9.247979E-05
> card 454: BOD 1 5 = 1.761588E-07
---
< card 458: BOD 2 3 = -2.733317E-05
> card 458: BOD 2 3 = 1.230041E-05
---
< card 459: BOD 2 4 = 1.69567E-06
> card 459: BOD 2 4 = 1.162446E-05
---
< card 460: BOD 2 5 = 1.481025E-05
> card 460: BOD 2 5 = 4.973081E-06
---
< card 633: BRANCH = '/global/common/software/lsst/cori-haswell-intel/phosim/v3.7.9'
> card 633: BRANCH = '/global/common/software/lsst/cori-haswell-intel/phosim/v3.7.14'
---
< card 15: CD1_1 = -5.29596186877207E-05
> card 15: CD1_1 = -5.29596138098411E-05
---
< card 16: CD1_2 = 1.67838774285813E-05
> card 16: CD1_2 = 1.67838928201687E-05
---
< card 17: CD2_1 = -1.67838774285813E-05
> card 17: CD2_1 = -1.67838928201687E-05
---
< card 18: CD2_2 = -5.29596186877207E-05
> card 18: CD2_2 = -5.29596138098411E-05
---
< card 611: FPFILE = '/global/common/software/lsst/cori-haswell-intel/phosim/v3.7.9/data/l'
> card 611: FPFILE = '/global/common/software/lsst/cori-haswell-intel/phosim/v3.7.14/data/'
---
< card 37: IMGDIR = '/global/common/software/lsst/cori-haswell-intel/phosim/v3.7.9/data/i'
> card 37: IMGDIR = '/global/common/software/lsst/cori-haswell-intel/phosim/v3.7.14/data/'
---
< card 68: ROTANGZ = 197.584366383143
> card 68: ROTANGZ = 197.584383034944
---
< card 36: SEDDIR = '/global/cscratch1/sd/descpho/Pipeline-tasks/DC2-R1-2p-WFD-y/000006/S'
> card 36: SEDDIR = '/global/cscratch1/sd/descpho/Pipeline-tasks/test-WFD-y/000006/SEDs'
---
< card 70: ZENITH = 13.9381256984265
> card 70: ZENITH = 13.9380538056427
---
< card 82: ZER 0 3 = -2.733317E-05
> card 82: ZER 0 3 = 1.230041E-05
---
< card 83: ZER 0 4 = 1.69567E-06
> card 83: ZER 0 4 = 1.162446E-05
---
< card 84: ZER 0 5 = 1.481025E-05
> card 84: ZER 0 5 = 4.973081E-06
---
< card 85: ZER 0 6 = 9.755061E-06
> card 85: ZER 0 6 = -1.523945E-05
---
< card 86: ZER 0 7 = 1.505774E-05
> card 86: ZER 0 7 = 6.204145E-06
---
< card 87: ZER 0 8 = 1.281214E-06
> card 87: ZER 0 8 = -1.903451E-05
---
< card 88: ZER 0 9 = -6.040004E-06
> card 88: ZER 0 9 = 1.590424E-06
---
< card 89: ZER 010 = 1.232056E-05
> card 89: ZER 010 = 6.291264E-06
---
< card 90: ZER 011 = 1.867872E-06
> card 90: ZER 011 = -5.081734E-07
---
< card 91: ZER 012 = 7.503871E-06
> card 91: ZER 012 = 7.349473E-06
---
< card 92: ZER 013 = -1.807692E-06
> card 92: ZER 013 = -1.243054E-06
---
< card 93: ZER 014 = 1.78905E-06
> card 93: ZER 014 = -1.022187E-05
---
< card 94: ZER 015 = 8.361695E-06
> card 94: ZER 015 = -2.125326E-06
---
< card 95: ZER 016 = -2.092621E-06
> card 95: ZER 016 = -6.186138E-06
---
< card 96: ZER 017 = -9.316427E-06
> card 96: ZER 017 = -1.349618E-07
---
< card 97: ZER 018 = 2.991919E-06
> card 97: ZER 018 = 1.25934E-05
---
< card 98: ZER 019 = 3.925723E-06
> card 98: ZER 019 = 1.487202E-06
---
< card 99: ZER 020 = -4.796312E-06
> card 99: ZER 020 = 2.770873E-06
---
< card 110: ZER 1 3 = -3.077166E-05
> card 110: ZER 1 3 = -1.925894E-05
---
< card 111: ZER 1 4 = -4.57326E-05
> card 111: ZER 1 4 = 5.064447E-05
---
< card 112: ZER 1 5 = -9.247979E-05
> card 112: ZER 1 5 = 1.761588E-07
---
< card 113: ZER 1 6 = 5.627657E-06
> card 113: ZER 1 6 = -3.831325E-06
---
< card 114: ZER 1 7 = -1.3229E-05
> card 114: ZER 1 7 = 5.332543E-07
---
< card 115: ZER 1 8 = -3.95235E-05
> card 115: ZER 1 8 = 1.436844E-06
---
< card 116: ZER 1 9 = 2.477007E-05
> card 116: ZER 1 9 = -1.273148E-05
---
< card 117: ZER 110 = -1.249233E-05
> card 117: ZER 110 = 9.565704E-06
---
< card 118: ZER 111 = 2.938097E-06
> card 118: ZER 111 = -2.556876E-06
---
< card 119: ZER 112 = -1.971055E-05
> card 119: ZER 112 = 9.10336E-06
---
< card 120: ZER 113 = 1.15437E-06
> card 120: ZER 113 = 1.93246E-06
---
< card 121: ZER 114 = 5.489848E-06
> card 121: ZER 114 = -4.240682E-06
---
< card 122: ZER 115 = -6.689914E-06
> card 122: ZER 115 = -7.73816E-07
---
< card 123: ZER 116 = -4.706865E-06
> card 123: ZER 116 = -3.688771E-06
---
< card 124: ZER 117 = -1.715983E-06
> card 124: ZER 117 = 4.391131E-06
---
< card 125: ZER 118 = 9.166808E-06
> card 125: ZER 118 = -1.177108E-05
---
< card 126: ZER 119 = -1.678203E-06
> card 126: ZER 119 = 4.99118E-06
---
< card 127: ZER 120 = 1.333635E-05
> card 127: ZER 120 = 1.473233E-05
---
< card 138: ZER 2 3 = -2.733317E-05
> card 138: ZER 2 3 = 1.230041E-05
---
< card 139: ZER 2 4 = 1.69567E-06
> card 139: ZER 2 4 = 1.162446E-05
---
< card 140: ZER 2 5 = 1.481025E-05
> card 140: ZER 2 5 = 4.973081E-06
---
< card 141: ZER 2 6 = 9.755061E-06
> card 141: ZER 2 6 = -1.523945E-05
---
< card 142: ZER 2 7 = 1.505774E-05
> card 142: ZER 2 7 = 6.204145E-06
---
< card 143: ZER 2 8 = 1.281214E-06
> card 143: ZER 2 8 = -1.903451E-05
---
< card 144: ZER 2 9 = -6.040004E-06
> card 144: ZER 2 9 = 1.590424E-06
---
< card 145: ZER 210 = 1.232056E-05
> card 145: ZER 210 = 6.291264E-06
---
< card 146: ZER 211 = 1.867872E-06
> card 146: ZER 211 = -5.081734E-07
---
< card 147: ZER 212 = 7.503871E-06
> card 147: ZER 212 = 7.349473E-06
---
< card 148: ZER 213 = -1.807692E-06
> card 148: ZER 213 = -1.243054E-06
---
< card 149: ZER 214 = 1.78905E-06
> card 149: ZER 214 = -1.022187E-05
---
< card 150: ZER 215 = 8.361695E-06
> card 150: ZER 215 = -2.125326E-06
---
< card 151: ZER 216 = -2.092621E-06
> card 151: ZER 216 = -6.186138E-06
---
< card 152: ZER 217 = -9.316427E-06
> card 152: ZER 217 = -1.349618E-07
---
< card 153: ZER 218 = 2.991919E-06
> card 153: ZER 218 = 1.25934E-05
---
< card 154: ZER 219 = 3.925723E-06
> card 154: ZER 219 = 1.487202E-06
---
< card 155: ZER 220 = -4.796312E-06
> card 155: ZER 220 = 2.770873E-06
---
These two data units are not identical.
For reference, the command file and instanceCatalog files are located here for the production and test runs:
/global/cscratch1/sd/descpho/Pipeline-tasks/DC2-R1-2p-WFD-y/000006
/global/cscratch1/sd/descpho/Pipeline-tasks/test-WFD-y/000006
The image files (both electron and amplifier) are located here:
/global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-y/output/000006
/global/projecta/projectdirs/lsst/production/DC2/test-WFD-y/output/000006
yeah, those parameters will change if you use a different version, because it will be a new realization. so this is fine.
In the test of v3.7.14, were there always 5 instances of phosim running per KNL node, or would there possibly have been fewer than 5 instances running on the node toward the end when trying to finish up the last jobs? I was wondering if having fewer than 5 active phosims at some point in a sensor visit could make it look artificially fast.
John - I'm curious about the possible slight sky brightness variation. Would different realizations have slightly different shutter opening times? Slightly different realizations of the atmospheric effects that affect brightness?
Adrian, yes, there were times times when there were fewer than five instances per node. The times quoted are elapsed clock time per instance. Are you suggesting that the presence of other instances running on the same node might slow down the processing?
I think there are several effects that could change the wall clock of a particular phosim instance depending on how much total work is active on a node.
When a compute node is doing more total work, there will be less opportunity to use turbo mode and boost the clock frequency of compute cores. This is probably a ~20% effect.
I'm not exactly sure how pthreads work with the compute node kernel, slurm, and affinity settings, but it is possible that pthreads are not "pinned" to a particular set of cores/HW-threads, and they may be allowed to migrate around the hardware resources. In this case, if there are fewer than 5 phosim instances running on a node, the pthreads from the remaining instances might be allowed to "spread out" and run a bit more quickly (though possibly less efficiently overall). I don't know as much about the runtime rules for pthreads on XC40/KNL as I do about OpenMP threads, so I'm trying to chat with Intel folks to learn more about how to control and measure this.
I'm also not exactly sure how the workflow works, but I had assumed that it tried to keep the same number of phosim instances active on a compute node until hitting the wall clock limit, eg. when one phosim instance on a node finishes, the workflow finds the next sensor visit from a task queue and launches it on the compute node, so very little time is spent with fewer than the expected number of phosim instances active.
In my (simple) performance testing, I've been replicating the same phosim sensor visit N times on a single KNL node. So I'm trying to keep N phosim instances active for the same amount of wall clock, and trying to replicate roughly the same average throughput as my understanding of the workflow. I suspect there will be some difference in effective throughput during the single-threaded regions, but I'm hoping that difference is relatively small. I would expect the workflow to be slightly more efficient when the single-threaded region of one phosim instance overlaps with multi-threaded regions of the other phosim instances, as the multi-threaded regions of the other instances might get slightly more done when there are some extra resources available.
Folks, I'd like to move this discussion out of the log and into a new github issue, #163 dedicated to phoSim v3.7.14.
yes, but there a dozens of other things too. when you switch to a new versions its as if you went into a brand new universe. the random parts will get re-realized.
On May 16, 2018, at 7:36 AM, Adrian Pope notifications@github.com<mailto:notifications@github.com> wrote:
John - I'm curious about the possible slight sky brightness variation. Would different realizations have slightly different shutter opening times? Slightly different realizations of the atmospheric effects that affect brightness?
β You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-389542184, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8gCMhQxpqX3ogh94c6w-ppuONRbSks5tzDl8gaJpZM4RHrHX.
we know this is a small effect, however, Adrian, because even going well over the numbers of available cores has a negligible overhead we previously found. so if you load with 400 threads across multiple instances and there are only 288 cores per node, then you will get jobs completing a factor 288/400 slower than if you filled 288 exactly, so there isnβt much penalty.
but we should do all tests with a single node fully loaded like last summer.
john
On May 16, 2018, at 8:28 AM, Adrian Pope notifications@github.com<mailto:notifications@github.com> wrote:
I think there are several effects that could change the wall clock of a particular phosim instance depending on how much total work is active on a node.
When a compute node is doing more total work, there will be less opportunity to use turbo mode and boost the clock frequency of compute cores. This is probably a ~20% effect.
I'm not exactly sure how pthreads work with the compute node kernel, slurm, and affinity settings, but it is possible that pthreads are not "pinned" to a particular set of cores/HW-threads, and they may be allowed to migrate around the hardware resources. In this case, if there are fewer than 5 phosim instances running on a node, the pthreads from the remaining instances might be allowed to "spread out" and run a bit more quickly (though possibly less efficiently overall). I don't know as much about the runtime rules for pthreads on XC40/KNL as I do about OpenMP threads, so I'm trying to chat with Intel folks to learn more about how to control and measure this.
I'm also not exactly sure how the workflow works, but I had assumed that it tried to keep the same number of phosim instances active on a compute node until hitting the wall clock limit, eg. when one phosim instance on a node finishes, the workflow finds the next sensor visit from a task queue and launches it on the compute node, so very little time is spent with fewer than the expected number of phosim instances active.
In my (simple) performance testing, I've been replicating the same phosim sensor visit N times on a single KNL node. So I'm trying to keep N phosim instances active for the same amount of wall clock, and trying to replicate roughly the same average throughput as my understanding of the workflow. I suspect there will be some difference in effective throughput during the single-threaded regions, but I'm hoping that difference is relatively small. I would expect the workflow to be slightly more efficient when the single-threaded region of one phosim instance overlaps with multi-threaded regions of the other phosim instances, as the multi-threaded regions of the other instances might get slightly more done when there are some extra resources available.
β You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-389561271, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8stq6SGBUvQ0E0K_oHi4y5KpAcA1ks5tzEWhgaJpZM4RHrHX.
βββββββββββ
John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193
Currently on Sabbatical at: Institute for Astronomy University of Hawaii
Final preparations are being made to begin reprocessing the 9,888 long-running (and timed-out) phosim (raytrace) jobs. This mid-course correction contains two changes:
Because the number of threads is deeply embedded into the workflow, several steps were required to update the workflow. For posterity, those steps are summarized here.
Tom-
if you switch to a different version, youβll have to run all the jobs of the same visit over again.
john
On May 22, 2018, at 6:28 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:
Tuesday 22 May 2018 Update
Final preparations are being made to begin reprocessing the 9,888 long-running (and timed-out) phosim (raytrace) jobs. This mid-course correction contains two changes:
Because the number of threads is deeply embedded into the workflow, several steps were required to update the workflow. For posterity, those steps are summarized here.
β You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-391160859, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8ozORJVxFCz0aXHUhghDhqOggSxRks5t1JETgaJpZM4RHrHX.
Production of the remaining 9888 long-running sensor-visits started up earlier today, using phoSim 3.7.14 and 54 threads/instance (and 5 instances/node). This is something of an experiment - to see how well a fully-loaded KNL node handles this configuration. CPU utilization looks great and memory consumption is <15 GB (out of 95 GB). But there have already been about a dozen mysterious failures (sig4 and sig11) which I'm keeping an eye on. More news tomorrow.
Processing continues... Of the 9,888 long-running jobs, here is the status as of 14:35
Jobs successfully completed: 3736 (38% of target)
Jobs currently running: 1752 (18%)
There continue to be rather bizarre crashes at the few % level. It turns out that all of these oddball failures occur on exactly four KNL nodes (nid05377, nid05594, nid02566, nid02620) so I've submitted a ticket with the NERSC folks
In addition, there was a rash of 23 crashes at 11:13am this morning -- all associated with the perennial problem of reading in the (phosim config) focalplanelayout.txt file. This is being treated as a transient file system issue, so these have been rolled back and I expect they may well succeed next time around.
(On the road, so this will be brief.) The first round of 48-hour jobs completed this morning and the result if this first attempt was >50% of the remaining sensor-visits completed successfully. Status as of 15:20
Production continues - and with a refreshing lack of failures! Perhaps the reported machines have been taken off line... Status as of 20:50:
WIth only 2319 sensor-visits remaining (1.5% of the total in Run 1.2p), production may complete tomorrow (Sunday).
The current round of DC2 Run 1.2p processing, the "long-running job campaign", is nearly over, with some mop-up and a relatively small number of extra-long-running jobs to deal with. Here is the status of the long-running job campaign as of 07:00 today:
During the long-running job campaign, there were, as described previously, a number of abnormal crashes. The 25 remaining jobs then got restarted on Pilots that had <<48 hours remaining, and they did not complete. A set of fresh Pilots has now been submitted/running to mop up these few jobs.
Overall, the status of Run 1.2p now looks like this:
Recovery of the remaining 17 visits will require additional workflow tweaks to allow for a larger number of threads. Not knowing how much longer these jobs need, my plan is to raise the number of threads from 54 (1/5 of a KNL node) to either 90 (1/3 of a KNL node) or to 135 (1/2 a KNL node), in both cases leaving a thread or two for the OS.
All "long-running" jobs are now complete. Only 588 "extra long-running" jobs (sensor-visits) remain as indicated in the table. These jobs exceeded 48-hours x 54 threads on Cori KNL.
Survey/Band | # sensor-visits remaining | # visits affected |
---|---|---|
WFD u-band | 334 | 4 |
WFD r-band | 37 | 1 |
WFD y-band | 216 | 3 |
uDDF g-band | 1 | 1 |
TOTAL | 588 | 9 |
Status as of 09:30:
Afternoon updates: 1) all 588 extra-long running jobs are now running once again, this time with 90 threads (so 3 raytrace instances per KNL node). Hopefully this will take care of the outliers. We will find out in no more than 48 hours... 2) Due to popular request, all phoSim input files for DC2 Run 1.2p have been archived in NERSC project disk space. Find the data here:
/global/projecta/projectdirs/lsst/production/DC2/Run1.2p/phosimInput
Data are organized by workflow name, e.g., DC2-R1-2p-WFD-y, and a visit index. Each visit directory contains: commands.txt (phoSim commands/physics overrides); instCat and SEDs directories.
Total space occupied by these files is 904 GB.
Shockingly, of the 588 extra-long running phosim (raytrace) steps started two days ago, exactly one successfully completed, and that one was likely a shorter-running job that sneaked onto the list. So what happened? As the 48-hour time limit approached and with no sign of jobs completing, I logged into a handful of KNL nodes to see what what was happening. In all cases, instances of raytrace were running in a single thread. I had previously confirmed that raytrace was running with all 90 threads early on. This result means that a single bright source likely caused the extended run times, and that increasing the number of threads will likely have no impact - other than to waste even more core hours.
It would be good to know exactly what catalog items are triggering these anomalously long run times. The log files are not much help - they all stall at the point prior to reporting the source summary. The phosim input data (instance catalog, command file, SEDs) are all preserved. Perhaps @johnrpeterson could make a suggestion?
In other news (and this is very preliminary), I've resurrected my old checkpoint test scripts and tried to use the officially-installed (by NERSC) version of dmtcp (v3.0.0) on phoSim v3.7.14. What I am seeing is that when phosim is still in its single-thread operation, all works as it should, including the creation of checkpoint files when requested. Restarting from the checkpoint files also seems to work, although current tests only check the mechanics of the system, not the detailed output. However, once phoSim enters the multi-threaded phase of operation, requests to create a checkpoint appear to be totally ignored; no checkpoint files are created and phoSim continues merrily along its way. Bah! This symptom holds true for my own builds of dmtcp as well, so it is unlikely a bad build. This is being investigated...
Tom-
iβd just check the brightest magnitude in the catalog. i thought we were using a magnitude cut (10?)?
john
On Jun 2, 2018, at 5:40 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:
Saturday 2 June 2018 Update:
Shockingly, of the 588 extra-long running phosim (raytrace) steps started two days ago, exactly one successfully completed, and that one was likely a shorter-running job that sneaked onto the list. So what happened? As the 48-hour time limit approached and with no sign of jobs completing, I logged into a handful of KNL nodes to see what what was happening. In all cases, instances of raytrace were running in a single thread. I had previously confirmed that raytrace was running with all 90 threads early on. This result means that a single bright source likely caused the extended run times, and that increasing the number of threads will likely have no impact - other than to waste even more core hours.
It would be good to know exactly what catalog items are triggering these anomalously long run times. The log files are not much help - they all stall at the point prior to reporting the source summary. The phosim input data (instance catalog, command file, SEDs) are all preserved. Perhaps @johnrpetersonhttps://github.com/johnrpeterson could make a suggestion?
In other news (and this is very preliminary), I've resurrected my old checkpoint test scripts and tried to use the officially-installed (by NERSC) version of dmtcp (v3.0.0) on phoSim v3.7.14. What I am seeing is that when phosim is still in its single-thread operation, all works as it should, including the creation of checkpoint files when requested. Restarting from the checkpoint files also seems to work, although current tests only check the mechanics of the system, not the detailed output. However, once phoSim enters the multi-threaded phase of operation, requests to create a checkpoint appear to be totally ignored; no checkpoint files are created and phoSim continues merrily along its way. Bah! This symptom holds true for my own builds of dmtcp as well, so it is unlikely a bad build. This is being investigated...
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-394119148, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8pAjpA6Jr5sAjSFo1f6wCv7ALQNCks5t4wZggaJpZM4RHrHX.
The long-running jobs noted over the weekend were discovered to contain a very bright (M4.5) AGN. How this happened is being studied by the experts and will hopefully not happen again in Run 2.0. Fixing the problem, either by removing the offending sources completely or manually resetting their magnitudes to "10" seemed risky given how that might affect downstream processing. Thus, it was decided DC2 Run 1.2p phoSim production would be officially deemed complete.
The design goal for Run 1.2p is shown in the table at the very top of this issue. The "missing" visits are tabulated below:
Survey/Band | # sensor-visits unfinished | # visits affected |
---|---|---|
WFD u-band | 334 | 4 |
WFD r-band | 37 | 1 |
WFD y-band | 216 | 3 |
TOTAL | 587 | 8 |
After a thrilling 2-month phoSim production run, for Run 1.2p this is The End.
To be sure, this is THE BEGINNING for the analysis working groups. We hope that you will be able to use the results to design and test data access patterns and pipelines and will help with very important validation tasks. THANKS @TomGlanzman!
good job! and this makes sense now.
On Jun 4, 2018, at 5:04 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:
Monday 4 June 2018 Final Update:
The long-running jobs noted over the weekend were discovered to contain a very bright (M4.5) galaxy. How this happened is being studied by the experts and will hopefully not happen again in Run 2.0. Fixing the problem, either by removing the offending sources completely or manually resetting their magnitudes to "10" seemed risky given how that might affect downstream processing. Thus, it was decided DC2 Run 1.2p phoSim production would be officially deemed complete.
The design goal for Run 1.2p is shown in the table at the very top of this issue. The "missing" visits are tabulated below:
Survey/Band # sensor-visits unfinished # visits affected WFD u-band 334 4 WFD r-band 37 1 WFD y-band 216 3 TOTAL 587 8
After a thrilling 2-month phoSim production run, for Run 1.2p this is The Endhttps://www.youtube.com/watch?v=JSUIQgEVDM4.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-394498941, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8oZqysjqbaeJrtw6N7e2RlrBOHtdks5t5aDogaJpZM4RHrHX.
The storage footprint for running DC2 Run 1.2p is as follows,
Space | Workflow |
---|---|
301G | DC2-R1-2p-WFD-g |
852G | DC2-R1-2p-WFD-i |
951G | DC2-R1-2p-WFD-r |
187G | DC2-R1-2p-WFD-u |
979G | DC2-R1-2p-WFD-y |
977G | DC2-R1-2p-WFD-z |
648G | DC2-R1-2p-uDDF-g |
630G | DC2-R1-2p-uDDF-i |
648G | DC2-R1-2p-uDDF-r |
714G | DC2-R1-2p-uDDF-u |
628G | DC2-R1-2p-uDDF-y |
633G | DC2-R1-2p-uDDF-z |
904G | phoSim inputs (instCat, SEDs) |
9052G | TOTAL |
The clock time distribution for running the raytrace+e2adc phosim steps, either with 8-threads or 54 threads are shown in the following plots, one per workflow task.
uDDF-y
For the record, here is a snapshot of the final workflow status:
DC2 Run 1.2p summary
====================
2018-06-14 09:00:11.115332 Pacific
(Note: jobs in the "QUEUED" state are not tallied, and the "ALL" column is currently unreliable)
=======Task and (batch) Process-step Rollup==============================
+-----------+-----------+-----------+-----------+----------+--------------+-------+
| WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------+-----------+-----------+-----------+----------+--------------+-------|
| 0 | 0 | 0 | 206195 | 0 | 587 | 0 |
+-----------+-----------+-----------+-----------+----------+--------------+-------+
=======Task Rollup===============================================
+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| setupVisit | 0 | 0 | 0 | 2001 | 0 | 0 | 0 |
| RunTrim | 0 | 0 | 0 | 44022 | 0 | 0 | 0 |
| RunRaytrace | 0 | 0 | 0 | 158179 | 0 | 587 | 0 |
| finishVisit | 0 | 0 | 0 | 1993 | 0 | 0 | 0 |
+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
======================================================
Total rolled back process steps = 368861
=======Detailed Task Summary======================================
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-u | setupVisit | 0 | 0 | 0 | 67 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | RunTrim | 0 | 0 | 0 | 1474 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-u | RunRaytrace | 0 | 0 | 0 | 4487 | 0 | 334 | 0 |
| DC2-R1-2p-WFD-u | finishVisit | 0 | 0 | 0 | 63 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-g | setupVisit | 0 | 0 | 0 | 91 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | RunTrim | 0 | 0 | 0 | 2002 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | RunRaytrace | 0 | 0 | 0 | 6118 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-g | finishVisit | 0 | 0 | 0 | 91 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-r | setupVisit | 0 | 0 | 0 | 245 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | RunTrim | 0 | 0 | 0 | 5390 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-r | RunRaytrace | 0 | 0 | 0 | 18437 | 0 | 37 | 0 |
| DC2-R1-2p-WFD-r | finishVisit | 0 | 0 | 0 | 244 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-i | setupVisit | 0 | 0 | 0 | 223 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | RunTrim | 0 | 0 | 0 | 4906 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | RunRaytrace | 0 | 0 | 0 | 16243 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-i | finishVisit | 0 | 0 | 0 | 223 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-z | setupVisit | 0 | 0 | 0 | 247 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | RunTrim | 0 | 0 | 0 | 5434 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | RunRaytrace | 0 | 0 | 0 | 18099 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-z | finishVisit | 0 | 0 | 0 | 247 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-WFD-y | setupVisit | 0 | 0 | 0 | 252 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | RunTrim | 0 | 0 | 0 | 5544 | 0 | 0 | 0 |
| DC2-R1-2p-WFD-y | RunRaytrace | 0 | 0 | 0 | 17809 | 0 | 216 | 0 |
| DC2-R1-2p-WFD-y | finishVisit | 0 | 0 | 0 | 249 | 0 | 0 | 0 |
+-----------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-u | setupVisit | 0 | 0 | 0 | 192 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | RunTrim | 0 | 0 | 0 | 4224 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | RunRaytrace | 0 | 0 | 0 | 16863 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-u | finishVisit | 0 | 0 | 0 | 192 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-g | setupVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | RunTrim | 0 | 0 | 0 | 3036 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | RunRaytrace | 0 | 0 | 0 | 12135 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-g | finishVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-r | setupVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | RunTrim | 0 | 0 | 0 | 3036 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | RunRaytrace | 0 | 0 | 0 | 12129 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-r | finishVisit | 0 | 0 | 0 | 138 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-i | setupVisit | 0 | 0 | 0 | 137 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | RunTrim | 0 | 0 | 0 | 3014 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | RunRaytrace | 0 | 0 | 0 | 12043 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-i | finishVisit | 0 | 0 | 0 | 137 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-z | setupVisit | 0 | 0 | 0 | 136 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | RunTrim | 0 | 0 | 0 | 2992 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | RunRaytrace | 0 | 0 | 0 | 11930 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-z | finishVisit | 0 | 0 | 0 | 136 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
| Task | ProcessStep | WAITING | PENDING | RUNNING | SUCCESS | FAILED | TERMINATED | ALL |
|------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------|
| DC2-R1-2p-uDDF-y | setupVisit | 0 | 0 | 0 | 135 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | RunTrim | 0 | 0 | 0 | 2970 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | RunRaytrace | 0 | 0 | 0 | 11886 | 0 | 0 | 0 |
| DC2-R1-2p-uDDF-y | finishVisit | 0 | 0 | 0 | 135 | 0 | 0 | 0 |
+------------------+---------------+-----------+-----------+-----------+-----------+----------+--------------+-------+
Elapsed time for workflow database queries = 33.522016 sec
And finally, a plot of Cori utilization (# of Cori-KNL nodes as a function of time) and NERSC-hour consumption during the entire Run 1.2p period. Remember that a single Cori-KNL node contains 68 cores / 272 hardware threads, thus each node could support up to ~34 instances of 8-threaded 'raytrace'.
[updated for Run 1.2p] This issue will be a log/diary the operational progress of the protoDC2 phoSim image generation at NERSC. This issue is not intended to be a venue for discussing phoSim configuration (see, for example, #19, #33, #134, #140 and #163 ) or results. A few technical details about the workflow itself can be found here.
As data accumulate, you may find the image files in this directory tree (for the WFD field and r-filter):
/global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output
Each subdirectory corresponds to a single visit. The phoSim working directories (in $SCRATCH) are here (again, for the WFD field and r-filter):/global/cscratch1/sd/descpho/Pipeline-tasks/DC2-R1-2p-WFD-r
and similarly organized in subdirectories, one per visit.Real-time monitoring of the 12 workflows:
Each field (WFD and uDDF) and band (u,g,r,i,z,y) have a fixed number of visits per the following table.
Approx total sensor-visits = 158,766