Closed TomGlanzman closed 6 years ago
Production has started, although it is not unlikely a problem will be discovered to cause a halt and restart from the beginning. A last minute decision: disable the phosim amplifier file output due to problems with those files.
The NERSC queues are not behaving nicely. A single-node 24-hour job submitted on Monday morning is now scheduled to run tomorrow at the earliest. To jump-start the process, a 10-node 8-hour KNL Pilot was submitted this afternoon and, lo!, it has started. There are now >300 raytrace processes running.
A question arose earlier about how long it might take to complete the ~8000 visits in protoDC2. Given that changes were made in the phoSim command file just yesterday, we won't have good data until sufficient sensor-visits have completed. Hopefully tomorrow...
We're off and running!
Some success over the past 12 hours: 1865 sensor-visits completed (mostly on KNL) using the latest phoSim command file (ref #63). Based on these runtime performance statistics, a first estimate of total protoDC2 resource and time consumption may be made.
band | #visits |
---|---|
u | 534 |
g | 776 |
r | 1782 |
i | 1795 |
z | 1612 |
y | 1581 |
Total | 8080 |
Using average values:
8080 visits * 75 sensors/visit = 606,000 sensor-visits to simulate
Raytrace is run with 8-threads/instance, and 34 instances/KNL node, completely filling the available 272 hardware threads
Total KNL node-hours = 606,000 sensor-visits (250 min/sensor-visit) (1 hr/60 m) * (8 threads) / (272 threads/node) = 74265 node-hours = 3094 node-days
Total NERSC-hours = 74265 node-hours * (96 NERSC-hours/node-hour) = 7.1M NERSC-hours (this assumes no wasted processing -- which will not be the case)
This amount of processing could, under unrealistically ideal conditions, be performed in less than one week using ~1000 nodes. The trick will be keeping jobs running efficiently. The main challenges include:
Getting SLURM jobs to run
There is some built-in inefficiency due to jobs terminating when the SLURM job times out (they must then be rerun from the beginning)
SLURM queue dwell times rise quickly with requested job length, thus making shorter jobs preferable from a dispatch perspective, but exacerbating the terminating job inefficiency
The instanceCatalog generation is delicate and must be done so as to not overload the UW server, but performed sufficiently quickly to keep the processing pipeline happy
The SLAC holiday power outage (26th-30th Dec) will impact production
I will be traveling and unable to focus 100% on this project over the coming 2 weeks.
The main r-filter workflow monitor is here. An experimental Pilot job monitor is here.
21:45 update - production is accelerating. All 1782 r-filter visits have been submitted. Current performance graphs for the raytrace step:
As of 07:30 PST, over 19,000 sensor-visits have been completed (representing ~3% of the anticipated 606k total sensor-visits in the current visit lists). The challenge has been keeping SLURM jobs running. Short jobs (8-10 hours) seem to start up within a few hours, but they suffer from serious inefficiency when they end -- taking many partially run raytrace jobs with them. Long jobs (24 hours) take many days in queue before they begin running.
When a block of cori-KNL nodes do begin to run, this is often accompanied by a set of failed raytrace jobs -- which fail when attempting to access one of the phoSim site or instrument files. These jobs fail almost immediately (hence, a small impact on efficiency) and can be easily rolled back. My guess is that the shock of starting hundreds of jobs simultaneously is putting a strain on the connection to the file system. An annoyance, but not (yet) serious. Side note: jobs of 50 KNL nodes represent the maximum size jobs yet submitted. As experience accrues, larger jobs will be submitted.
The plan going forward will be to attempt keeping a mix of short-running and long-running jobs in the cori queues in the hopes of improving overall utilization. A plot of KNL usage vs. time is beginning to take form here: https://portal.nersc.gov/project/lsst/glanzman/graph3.html
13:15 UPDATE: Due to issues described here, all jobs have been cancelled or held pending resolution. Production will eventually be restarted from the beginning
@TomGlanzman Starting hundreds of jobs should not be a "-- strain on the connection to the files system." In principle, hundreds of jobs is nothing to worry about -- you should file a ticket with NERSC about this. Something is not working correctly.
Updating workflow to reflect changes/fixes since the December run.
Changes:
gcr-catalogs updated from github (master), checkout pre-newYear version, build aux file:
git clone https://github.com/LSSTDESC/gcr-catalogs.git
git checkout 204c504bd785fc9127a01c3c5f9a24640b3e7583
cd gcr-catalogs/GCRCatSimInterface/data
source /global/common/software/lsst/cori-haswell-gcc/stack/setup_w_2017_46_py3_gcc6.sh
setup lsst_sims
python get_sed_mags.py
generation of instanceCatalog option change from '--descqa_cat_file proto-dc2_v2.1.1' to protoDC2
All previous output from the December 2017 trial run (Run 1.0a) has been temporarily moved aside in preparation for deletion. Please let me know if it is necessary to preserve these data from this early attempt.
The initial new data (notionally, Run 1.0b) will be generated from workflows DC2-phoSim-2-r version 1.000 and DC2-phoSim-2-i version 1.000. The first visits in each of these two bands are running now. Visits simulated are the same as those used in December (but only r-band exists for the December run), thus a comparison will be possible between Jan and Dec images.
The exact amount of data to be produced is still under discussion, although 3-4 visits in all six bands has been put forward.
Run 1.0b jobs are running, but slowly. Individual sensor-visits are currently taking >240 min (clock time) in the RayTrace step (using 8 threads), so cannot be run in NERSC's "qos=interactive" service :cry: . The average run time for this step (based on small statistics) is ~255 minutes (rotten luck!) but with tails extending to 350 min. Therefore, these steps must be done using the normal (non-interactive) batch queue - which means waiting many hours to up to a week for jobs to start.
An easy way to monitor the overall progress is with this Pipeline status page, the number in the "green check" column represents the number of successfully completed visits for each filter.
For this run ("Run 1.0b"), five (5) visits for each of the six filters will be simulated, both to test the phoSim configuration, and the downstream pipeline. The new data are populating these directories in NERSC:/global/projecta/projectdirs/lsst/production/DC2:
DC2-phoSim-2-u/output
DC2-phoSim-2-g/output
DC2-phoSim-2-r/output
DC2-phoSim-2-i/output
DC2-phoSim-2-z/output
DC2-phoSim-2-y/output
For each band's 'output' directory, there is one sub-directory per visit. The visit sub-directory name, e.g., 000000, is an index representing its order in the visit catalog. The visitID can be obtained by looking into the visit directory files, e.g., DC2-phoSim-2-r/output/000000 contains:
-rw-rw----+ 1 descpho lsst 886095 Jan 17 09:47 centroid_lsst_e_158370_f1_R01_S02_E000.txt
-rw-rw----+ 1 descpho lsst 837341 Jan 17 09:51 centroid_lsst_e_158370_f1_R02_S22_E000.txt
-rw-rw----+ 1 descpho lsst 1082843 Jan 17 10:10 centroid_lsst_e_158370_f1_R13_S01_E000.txt
-rw-rw----+ 1 descpho lsst 826844 Jan 17 09:44 centroid_lsst_e_158370_f1_R13_S12_E000.txt
-rw-rw----+ 1 descpho lsst 851356 Jan 17 09:35 centroid_lsst_e_158370_f1_R24_S02_E000.txt
-rw-rw----+ 1 descpho lsst 25838119 Jan 17 09:47 lsst_e_158370_f1_R01_S02_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25201205 Jan 17 09:51 lsst_e_158370_f1_R02_S22_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25193942 Jan 17 10:10 lsst_e_158370_f1_R13_S01_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25125607 Jan 17 09:44 lsst_e_158370_f1_R13_S12_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25070875 Jan 17 09:35 lsst_e_158370_f1_R24_S02_E000.fits.gz
The file lsst_e_158370_f1_R01_S02_E000.fits.gz is an image file for visit 158370, using filter 1 ('r'), for sensor R01_S02, and 'snap' 000 (only one snap per visit for this project).
One may also compare the results of Run 1.0a (December) with these new data. The old data, r-band only, reside here: /global/projecta/projectdirs/lsst/production/DC2/old.Dec2017/DC2-phoSim-2-r/output, one visit per sub-directory, as above.
That is indeed unfortunate!
How about getting a reservation for this? I can send a quick email to Debbie and Peter to ask how quickly that can be set up (I got one within a week last year for a 6000 node reservation, so we should get something smaller much quicker).
Questions for you if this is viable: for how long would you need the reservation and for how many nodes? When would be a reasonable time for you to have the reservation (you would want to watch things carefully when they run so that the machine doesn't end up idling).
Please let me know if you think this is a good idea and we can start the process right now (well, almost right now).
Thanks, Katrin
On 1/17/18 12:32 PM, Tom Glanzman wrote:
Wednesday 17 Jan 2018 Update:
*
Run 1.0b jobs are running, but slowly. Individual sensor-visits are currently taking >240 min (clock time) in the RayTrace step (using 8 threads), so cannot be run in NERSC's "qos=interactive" service 😢 . The average run time for this step (based on small statistics) is ~255 minutes (rotten luck!) but with tails extending to 350 min. Therefore, these steps must be done using the normal (non-interactive) batch queue - which means waiting many hours to up to a /week/ for jobs to start.
*
An easy way to monitor the overall progress is with this Pipeline status page <http://srs.slac.stanford.edu/Pipeline-II/exp/LSST-DESC/index.jsp?versionGroup=latestVersions&submit=Filter&d-4021922-s=1&d-4021922-o=2&taskFilter=DC2-phoSim&include=last30>, the number in the "green check" column represents the number of successfully completed visits for each filter.
*
For this run ("Run 1.0b"), five (5) visits for each of the six filters will be simulated, both to test the phoSim configuration, and the downstream pipeline. The new data are populating these directories in *NERSC:/global/projecta/projectdirs/lsst/production/DC2*:
DC2-phoSim-2-u/output DC2-phoSim-2-g/output DC2-phoSim-2-r/output DC2-phoSim-2-i/output DC2-phoSim-2-z/output DC2-phoSim-2-y/output
For each band's 'output' directory, there is one sub-directory per visit. The visit sub-directory name, e.g., 000000, is an index representing its order in the visit catalog. The visitID can be obtained by looking into the visit directory files, e.g., DC2-phoSim-2-r/output/000000 contains:
-rw-rw----+ 1 descpho lsst 886095 Jan 17 09:47 centroid_lsst_e_158370_f1_R01_S02_E000.txt -rw-rw----+ 1 descpho lsst 837341 Jan 17 09:51 centroid_lsst_e_158370_f1_R02_S22_E000.txt -rw-rw----+ 1 descpho lsst 1082843 Jan 17 10:10 centroid_lsst_e_158370_f1_R13_S01_E000.txt -rw-rw----+ 1 descpho lsst 826844 Jan 17 09:44 centroid_lsst_e_158370_f1_R13_S12_E000.txt -rw-rw----+ 1 descpho lsst 851356 Jan 17 09:35 centroid_lsst_e_158370_f1_R24_S02_E000.txt -rw-rw----+ 1 descpho lsst 25838119 Jan 17 09:47 lsst_e_158370_f1_R01_S02_E000.fits.gz -rw-rw----+ 1 descpho lsst 25201205 Jan 17 09:51 lsst_e_158370_f1_R02_S22_E000.fits.gz -rw-rw----+ 1 descpho lsst 25193942 Jan 17 10:10 lsst_e_158370_f1_R13_S01_E000.fits.gz -rw-rw----+ 1 descpho lsst 25125607 Jan 17 09:44 lsst_e_158370_f1_R13_S12_E000.fits.gz -rw-rw----+ 1 descpho lsst 25070875 Jan 17 09:35 lsst_e_158370_f1_R24_S02_E000.fits.gz
The file lsst_e_158370_f1_R01_S02_E000.fits.gz is an image file for visit 158370, using filter 1 ('r'), for sensor R01_S02, and 'snap' 000 (only one snap per visit for this project).
- One may also compare the results of Run 1.0a (December) with these new data. The old data, r-band only, reside here: /global/projecta/projectdirs/lsst/production/DC2/old.Dec2017/DC2-phoSim-2-r/output, one visit per sub-directory, as above.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-358398794, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jMlIVJFMjFlbEJOBTipAJpqIys-yks5tLjyygaJpZM4RHrHX.
A reservation is a possibility, although the amount of lead time required for a reservations is comparable to the time awaiting for batch job to run. At the moment, there are (surprisingly) 46 KNL nodes running which will handle Run 1.0b (5 visits x 6 filters).
Hi Tom,
ok then. Usually the machines are much less busy in January because lots of new projects start and people are not immediately ready to go for full up runs. So not too surprising. But that's great that these are then available for testing already.
On 1/17/18 2:56 PM, Tom Glanzman wrote:
A reservation is a possibility, although the amount of lead time required for a reservations is comparable to the time awaiting for batch job to run. At the moment, there are (surprisingly) 46 KNL nodes running which will handle Run 1.0b (5 visits x 6 filters).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-358442266, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jKfBMz1EK3y32iLGtDlGqC6gHG0eks5tLl51gaJpZM4RHrHX.
Much progress for Run 1.0b (five visits for each of six filters). As of 07:50 PST fully 83% of all sensor-visits have successfully completed. Run statistics are beginning to shape up and currently look like this:
Task | sensor-visits | mean clock time (Raytrace) |
---|---|---|
DC2-phoSim-2-u | 422/431 complete | 219 +/- 71 min |
DC2-phoSim-2-g | 509/512 complete | 214 +/- 37 min |
DC2-phoSim-2-r | 379/380 complete | 260 +/- 46 min |
DC2-phoSim-2-i | 224/224 complete | 234 +/- 62 min |
DC2-phoSim-2-z | 142/377 complete | 705 +/- 50 min |
DC2-phoSim-2-y | 336/500 complete | 356 +/- 124 min |
Total | 2012/2424 complete (83%) |
As of 16:25 PST, there are 31 sensor-visits still running. These stragglers are mostly in the z-band, with a couple in the y-band. These two bands are requiring significantly more CPU effort per visit than the other bands. The other four bands are complete (5 visits each).
The final sensor visits completed early Sunday morning (yesterday), so Run 1.0b is complete.
However: Due to a speckling issue, a new version of phoSim has been released (v3.7.7). Heather has installed the new code at NERSC and the plan is to reprocess the very first r-band visit, DC2-phoSim-2-r stream 000000, visitID 151687. The old (v3.7.6) data has been moved aside into this directory: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000.v3.7.6 while the new (v3.7.7) data will be placed here: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000
Jobs are running now with an ETA to completion around 20:00 Pacific this evening.
Note: there may be yet another phoSim code release as the root cause of the speckling is understood and fixed.
The test jobs (a single r-band visit) using phoSim v3.7.7 completed last evening but continue to show the speckling problem. The PhoSim team has now reproduced the problem and will advise when new code is available for testing.
In the mean time, continue preparations for Run 1.1...
(LATER) The PhoSim team has released v3.7.8. Heather has installed. Rolling back the first visit in the r-band (visitID 151687). The previous test output has moved aside into: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000.v3.7.7 while the new (v3.7.8) data will be placed: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000
Job started at 10:37 so will likely be a few hours before the first sensor-visit rolls off the assembly line.
NOTE: I discovered that the method used to determine the type of processor used (haswell vs knl) started to fail just after the new allocation year changes went into effect on January 9th. Thus, all of Run 1.0b raytrace jobs which ran on KNL used code optimized for haswell. The only impact is execution time. This problem has now been fixed.
Fixed, Tom. Please use phosim v3.7.8.
A test of phoSim v3.7.8 indicates the speckling problem seems to have been solved, see (https://github.com/LSSTDESC/DC2_Repo/issues/69#issuecomment-359990670) and #105 ). Thanks to the PhoSim team!
No production running today - only various tests (#101) and development (#82) in preparation for Run 1.1
Production for Run 1.1 (phosim) is imminent. Initial test of revamped instanceCatalog generation, dynamic SEDs, updated visit lists, and other config changes and bug fixes has been completed for a single visit in the r-band (WFD). Data products for this test are here:
/global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-3_WFD-r/output/000001
Note that this test is NOT the final configuration as a final tag of DC2_repo, and agreement on the phoSim "--fov" parameter is needed -- so these data will be overwritten with production data in the near future.
Run 1.1 consists of the the following visits.
Band | Survey | #Visits |
---|---|---|
u | WFD | 67 |
g | WFD | 91 |
r | WFD | 245 |
i | WFD | 223 |
z | WFD | 247 |
y | WFD | 252 |
u | uDDF | 192 |
g | uDDF | 138 |
r | uDDF | 138 |
i | uDDF | 137 |
z | uDDF | 136 |
y | uDDF | 135 |
- | TOTAL | 2001 |
A new DC2 phoSim test visit has been completed, obsHistID=181866 with 63 sensors simulated. Data products are here: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-3_WFD-r/output/000002 . The biggest change since yesterday's test was the change in the --fov parameter from 0.5 to 2.1 (degrees) during instanceCatalog generation. Please have a look.
Ducks are lined up, so begin a production for Run 1.1, starting with r-band, WFD.
Production is ramping up. All 245 visits for r-band WFD have been submitted. Twenty-five for each of the remaining 11 configurations have also been submitted.
There have been a few failures: 1) the expected failures when a Pilot terminates (normally, after it times-out); and, 2) the familiar issue of phoSim reading one of its many files during initialization. This latter issue arises when many phoSim instances attempt to start (near) simultaneously. A simple rollback solves the problem.
At 20:50 today, 425 sensor-visits have completed and 1904 are running. I expect the number of running to increase as new KNL Pilot jobs start to run.
Here is a link to view all 12 workflows on one page.
Addendum: A new failure is beginning to appear in the instanceCatalog generation. This has been reported here.
As the starting benchmark, our NERSC allocation now stands at:
19:51 2/20/2019 m1727 cpu balance = 94,572,858.2
Thanks @johannct, I actually provided tricklestream to Stephan some years ago so am quite familiar with its use. In this case, we do not yet know enough to use this tool effectively -- we are still in the process of discovering the limits of the instanceCatalog generation step. Also, given that DC2-phoSim consists of 12 independent pipeline tasks, trickleStream will need some work (and thought) to make it useful.
Production continues to ramp up. At the moment (09:00) there are 27 fully completed visits and 4128 completed sensor-visits (which includes both fully and partially completed visits). 1561 sensor-visits are currently running on 46 KNL nodes (about 34 phoSim instances/node). Another 200 nodes have been requested but are not yet running.
A couple of operation issues arose in the last 12 hours.
10:25 - create a new DBstaging area in $SCRATCH and make copies of the agn and minion(obssim) databases, then point the instanceCat generator to use those copies. Let's see if that improves scalability...
12:32 - update the agn database after @danielsf modifies the file to eliminate the creation of two temporary files (which were creating problems elsewhere).
16:10 - All remaining visits for the 12 visit categories have now been submitted. The program is: submit Pilots; wait; rollback; repeat. Currently, 48 visits (of 2001) have completed.
As of 09:00 there are 66 fully completed visits (3% of Run 1.1's 2001 visits) with many more in progress. Cori utilization has been rather low the past 24 hours: a max of 46 nodes dropping down to a mere 8 for several hours this morning. About 1000 nodes have been requested, so we simply must wait for SLURM to let them run.
Good news is that both issues opened yesterday have been solved and closed. In addition to the expected job terminations due to Pilot time-outs, a new failure is starting to appear: segmentation faults in phoSim's atmosphere creator, which seems correlated with many simultaneous instances (in the 30-50 range). This problem acts as a transient, generally succeeding upon rollback, but bears watching.
Handy links (note some are updated only every few minutes):
As of 10:30 today, there are 76 fully completed visits representing 4% of total. Production has ground to a virtual halt as the small set of Pilot jobs that started running 2 days ago have all completed. Submitted Pilot jobs are just not running! There are jobs submitted on the 19th that are still waiting for a chance to run. SLURM indicates that up to 170 may come available in ~9.4 hours, but those estimates are largely unreliable. Perhaps there will be a surge of activity starting this evening. In the meantime, I am using various tricks to prepare the maximum number of waiting 'raytrace' steps as possible.
While waiting, I put together an experimental task monitoring page to give me a better global view of what the 12 tasks are up to. It is not very pretty but it's a start. (Hint: the good stuff is at the bottom.)
Thanks @TomGlanzman ! Sorry to hear we are not yet up to LSST data rates - but then I expect commissioning might be a bit like this too ;-)
While we are waiting for the conditions to improve, would you mind breaking down that 4% by band and survey please? Here's an extended version of the table you made further up the thread, is it easy for you to fill in (or automagically over-write) the table below? You can edit this message no problem, or just paste in your own version in a subsequent comment. I'm interested to see what kind of DM processing we could be doing. Thanks!
Band | Survey | Target # Visits | Completed # Visits | % complete |
---|---|---|---|---|
u | WFD | 67 | 0 | 0 |
g | WFD | 91 | 2 | 2 |
r | WFD | 245 | 67 | 27 |
i | WFD | 223 | 1 | .4 |
z | WFD | 247 | 4 | 1.6 |
y | WFD | 252 | 2 | .8 |
------- | --------- | --------------- | ------------------- | ------------- |
u | DDF | 192 | 0 | 0 |
g | DDF | 138 | 0 | 0 |
r | DDF | 138 | 0 | 0 |
i | DDF | 137 | 0 | 0 |
z | DDF | 136 | 0 | 0 |
y | DDF | 135 | 0 | 0 |
------- | --------- | --------------- | ------------------- | ------------- |
- | TOTAL | 2001 | 76 |
(Note from Tom: the values in the "Completed # Visits" column may be easily obtained from this workflow web page.)
Not much happening today -- Pilot jobs submitted last Monday still have not started. Consider:
JOBID ST USER NAME NODES REQUESTED USED SUBMIT QOS SCHEDULED_START FEATURES REASON
10393524 PD descpho phoSimK-20* 20 48:00:00 0:00 2018-02-19T19:31:24 regular_1 2018-02-26T19:40:00 knl&quad&cache Resources
10414515 PD descpho phoSimK-20* 20 48:00:00 0:00 2018-02-20T19:46:28 regular_1 2018-02-26T19:40:00 knl&quad&cache Resources
10414517 PD descpho phoSimK-20* 20 48:00:00 0:00 2018-02-20T19:46:29 regular_1 avail_in_~0.1_hrs knl&quad&cache Resources
10414518 PD descpho phoSimK-20* 20 48:00:00 0:00 2018-02-20T19:46:30 regular_1 avail_in_~0.1_hrs knl&quad&cache Resources
10414519 PD descpho phoSimK-20* 20 48:00:00 0:00 2018-02-20T19:46:31 regular_1 avail_in_~0.1_hrs knl&quad&cache Resources
10414520 PD descpho phoSimK-20* 20 48:00:00 0:00 2018-02-20T19:46:32 regular_1 avail_in_~0.1_hrs knl&quad&cache Resources
10414530 PD descpho phoSimK-50* 50 48:00:00 0:00 2018-02-20T19:48:43 regular_1 avail_in_~0.1_hrs knl&quad&cache Resources
Job submitted last Monday are currently scheduled to run next Monday!
In the meantime, I am continuing to run the catalog 'trim' bit manually on a KNL interactive node (max 4 hour limit).
More of the same (see yesterday's report). The priorities at NERSC seem completely given over to huge MPI jobs - at the expense of the type we need for DC2-phoSim production... :(
We need to bug someone at NERSC; the queueing system was always something I had complained about. I will send an email to Richard Gerber.
The first half of the day was a repeat of yesterday -- no activity. Then at 5 minutes 'til noon, the first large SLURM jobs began to run. At first, two jobs (each with 20 nodes), followed by another job of 20 nodes. With these 60 nodes, about 2000 single sensor-visits (raytrace) jobs are running.
The task summary page has been updated and is now a bit easier to read. The tables indicate the overall progress of the DC2-phoSim Run 1.1.
The first table sums up the overall number of workflow batch jobs.
The second table breaks this down according to the process step (setupVisit = instanceCatalog generation and phoSim initialization; RunTrim is the phoSim catalog trimming step; RunRayTrace is the time-consuming ray tracing for a single sensor for a single visit; finishVisit runs when all sensors for a particular visit have completed). Thus, Run 1.1 will be complete when there are 2001 successful 'finishVisit' steps.
The final set of tables gives the summary for each workflow task == each survey configuration, e.g., WFD r-band.
By the end of the day, 170 KNL nodes were online with phoSim and each node has been reserved for 48 hours. Good news!
I sent an email to Richard Gerber yesterday -- let's see what he says.
Ok, Richard got back to me. He will get someone to talk to Tom; what they want is an estimate of the required throughput from us. You can provide that, right, Tom?
07:30 Finally! After a week's wait, a significant number of Cori resources are beginning to come online for DC2-phoSim. At this time, 8 Pilot jobs, on 190 nodes ( 12160 cores) are running, which support ~6400 raytrace instances (sensor-visits). There are currently 183 fully complete visits, or about 9% of total.
13:45 Cori continues to deliver more nodes. At this point 17 Pilot jobs, on 370 nodes ( 23680 cores) are running >7,100 raytrace instances.
An updated DC2 phoSim monitoring page is now available (it auto updates every 6 min)
Some evidence of Workflow Engine (Pipeline II) stress:
An interesting conversation with a couple of NERSC folks this afternoon. One reason for the poor SLURM response this past weekend was due to a massive 9,000 node MPI job that ran for several days. (Cori-KNL has a total of about 9,800 nodes, so this had a serious impact on the rest of us.) Other than that, they reaffirmed that the only strategies for better throughput they can offer include:
Neither of these two strategies is a silver bullet and will end up costing our allocation.
I also mentioned the lack of accuracy in the SLURM job dispatch estimate (in response to the 'sqs' command, for example), the burstiness of SLURM dispatch, the inability to execute a controlled ramp-up of production, and inability to control the rate at which jobs start. Maybe a fresh look will help solve some of these problems.
While the number of Pilot jobs is decreasing after running out the 48-hour clock, SLURM has not been kind to us in replenishing this effort. There remain dozens of jobs being held hostage in queue, some for over one week. Perhaps Cori is preparing of another mega-job and not allowing our 48-hour jobs to backfill... The question now is whether to request a 'reservation' - the request for which would take several days to process? From today's standpoint, something like 300-400 nodes for 2-3 days would make a huge dent in the remaining work to be done. But one cannot predict whether or how many of the existing queued jobs may start to run before a reservation could be put in place.
JOBID ST USER NAME NODES REQUESTED USED SUBMIT QOS SCHEDULED_START FEATURES REASON
10601130 PD stanier ikink 2048 24:00:00 0:00 2018-02-28T21:07:31 regular_0 2018-03-02T14:00:00 knl&quad&cache Resources
10535604 PD u6338 fullKahuna* 5600 36:00:00 0:00 2018-02-26T13:33:57 regular_0 2018-03-02T14:00:01 knl&quad&cache Resources
10599218 R heitmann test_knowh* 6144 8:00:00 28:45 2018-02-28T19:46:13 regular_0 2018-03-01T13:39:41 knl&quad&cache None
10599371 PD heitmann test_knowh* 6144 24:00:00 0:00 2018-02-28T19:54:40 regular_0 2018-03-04T02:00:00 knl&quad&cache Resources
10599382 PD heitmann test_knowh* 6144 24:00:00 0:00 2018-02-28T19:55:00 regular_0 2018-03-05T02:00:00 knl&quad&cache Resources
10599385 PD heitmann test_knowh* 6144 24:00:00 0:00 2018-02-28T19:55:07 regular_0 avail_in_~0.1_hrs knl&quad&cache Priority
10599394 PD heitmann test_knowh* 6144 24:00:00 0:00 2018-02-28T19:55:18 regular_0 avail_in_~0.1_hrs knl&quad&cache Priority
10535600 PD u6338 allKahuna_* 9200 2:00:00 0:00 2018-02-26T13:33:53 regular_0 2018-03-02T11:17:09 knl&quad&cache Resources
There is a 9,200 node mega-Job but it runs only for 2 hours. The five 6,144 node jobs, however, represent 63% of Cori-KNL, and four of them run for 24 hours each - which may be our biggest competition for the the coming days.
There are 550 complete visits, basically where we were yesterday. NERSC has ramped us down to a mere 40 nodes. There are 41 Pilots (880 nodes) waiting -- for up to 9 days -- to start running. So, while essentially idle, we wait...........
Run 1.1, although incomplete, is winding down. 713 visits are complete (36%).
Preparations for phoSim testing is underway in #140
Testing to resume production has gotten underway. With new phoSim background parameters and updated catalog generation, this new project is dubbed "Run 1.2p". A single visit with a non-production catalog generation is running here.
A NERSC 'reservation' has been requested (100 KNL nodes for 24-hours) to help jump-start this project. If granted, this reservation will begin sometime Thursday, 5 Apr 2018. Hopefully all production code will be in place by that time.
09:45 - The first Run 1.2p test run continues and the first six (of 32) sensors for this visit have completed. A first quick look at the FITS files indicates reasonable image files. (Chris W's worry about commented overrides being interpreted by phoSim did not come to pass; the FITS headers show that those commented background parameters were properly ignored.) There is a ~x2 difference in background across the field of view in these first few sensors. Interested parties are invited to look at both 'electron' and 'amplifier/ADC' image files which continue to populate this directory: /global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output/000000
11:00 - All 32 sensor-visits for the test vist have completed
DC2 Run 1.2p began yesterday evening! Five single-node 'premium' class SLURM Pilot jobs were submitted to get the ball rolling.
A fortuitous decision to "hold" rather than "cancel" a batch of SLURM jobs from Run 1.1p in February has lead to the discovery that once released, these jobs start very quickly. There are 31 such "held" jobs and they collectively represent 680 nodes each for 48 hours. Eleven of these 20-node jobs have been released and are now running. A good way to start the production!
11:00 status
Excellent news! Glad we gave this a try rather than killing the jobs back then.
On 4/6/18 12:59 PM, Tom Glanzman wrote:
Friday 6 Apr 2018 Update
DC2 Run 1.2p began yesterday evening! Five single-node 'premium' class SLURM Pilot jobs were submitted to get the ball rolling.
A fortuitous decision to "hold" rather than "cancel" a batch of SLURM jobs from Run 1.1p in February has lead to the discovery that once released, these jobs start very quickly. There are 31 such "held" jobs and they collectively represent 680 nodes each for 48 hours. Eleven of these 20-node jobs have been released and are now running. A good way to start the production!
11:00 status
- 16 SLURM Pilot jobs running on 225 hosts (14,400 cores)
- 115 sensor-visits complete
- 7350 jobs running
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-379330248, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jH5jTCF6_dtT9XKvmNu1YLFqdNSYks5tl60ZgaJpZM4RHrHX.
Thanks to those old SLURM jobs, production continues to run at a good pace. Fresh jobs submitted yesterday are slowly moving up in the queue. We have a 100-node 24-hour reservation coming up Monday morning at 10am. With all of these factors, production may smoothly continue even as the old jobs terminate.
13:00 status:
A productive weekend. After a reasonably stable weekend, will begin again to ramp up the number of nodes. A 100-node 24-hour reservation begins today at 10am.
08:00 status:
A good 24 hours of normal running, increasing the number of nodes to 760.
07:50 status:
For the WDF y-band workflow, the execution time distribution for the successful 622 sensor visits looks like this:
For the WDF z-band, only 36 of 1009 attempts successfully completed. The execution time distribution appears below:
15:00 status: Run 1.2p is losing cori-knl nodes; as jobs complete new jobs are not starting quickly. Thus, only 470 nodes are currently active (although 2461 are requested via SLURM).
The ramp-up of Run 1.2p has stalled and even lost momentum. This is because NERSC is not running jobs that have now been in queue for over five (5) days. At 09:45 I released the final set of jobs submitted in February (on the 22nd), so will see if those 3 jobs of 20 nodes each start running soon.
09:45 status:
As of this morning the pipeline has completely run dry: there are zero jobs running at present. Despite having >3000 nodes of batch jobs in the queues, some of which have been waiting for nearly a week, the NERSC scheduler is not providing computing resources, so production has ground to a full stop.
07:45 status:
09:30 update: The DC2 production has stopped due to full machine reservations associated with the yearly Gordon Bell challenge, for which the deadline is this coming Sunday. The current set of reservations started this morning at 09:00 and last for 27 hours. Hopefully DC2 jobs will again start to run shortly after noon tomorrow (Friday).
Noon update: The full machine reservations were cancelled at the last moment. Jobs have started to run again. We are currently up to 480 nodes ...oops, make that 430 (apparently 50 nodes crashed!)
Production has now been running for one week. After a 2-day ramp-down, an 8-hour outage and the false threat of a lengthier (27-hour 'reservation') outage to follow, production quickly ramped back up to just under 500 nodes where it has remained since yesterday noon.
07:40 status:
Note that ~4M NERSC hours have been spent on Run 1.2p thus far and it is nearly 25% complete.
Something bad happened just after midnight this morning which caused 600 nodes to crash (at least that what the clues point to). Further, whatever caused this to happen continued to plague new jobs all the way up to 08:00 this morning. The end result is that >81,000 jobs were abnormally terminated. It is a technical challenge simply to roll back such a large number of jobs :( (A NERSC ticket has been submitted to inquire about this episode but may not hear back until Monday.)
13:40 status:
Recovery from yesterday's mishap and, perhaps surprisingly, continued ramp up.
13:50 status:
18:00 status:
A surprisingly good weekend -- after recovering from the Saturday morning meltdown.
07:45 status:
Another look at our NERSC allocation... Between 5 April and today, 9,957,846 billable NERSC hours have been consumed. In this time, 54,929 raytrace jobs have been successfully completed, representing about 34% of the total. Assuming that Run 1.2p is the only significant consumer of resources, one might conclude that completing this run might require roughly 30M NERSC hours. Why is this number so high?
Firstly, in addition to the 54,929 successful raytrace jobs, there were also 12,497 long-running jobs that timed out after 48 hours and, thus, represent wasted effort because they will all need to be started again from the beginning. [We are currently running 33 instances of raytrace per Cori-KNL node, thus a single timed-out instance of raytrace consumes 48*96/33 = 140 NERSC hours. 12,497 such jobs thus represent 1.7M NERSC hours, or about 17.5% of the total.]
Secondly, whenever a Pilot job times out (after 48 hours), any partially completed jobs crash and must be restarted (unless that job ran for the full 48 hours, in which case it is ignored at present). Given that Run 1.2p jobs are taking 20 to >48 hours to complete, there is a significant inefficiency, possibly as high as 40%.
[updated for Run 1.2p] This issue will be a log/diary the operational progress of the protoDC2 phoSim image generation at NERSC. This issue is not intended to be a venue for discussing phoSim configuration (see, for example, #19, #33, #134, #140 and #163 ) or results. A few technical details about the workflow itself can be found here.
As data accumulate, you may find the image files in this directory tree (for the WFD field and r-filter):
/global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output
Each subdirectory corresponds to a single visit. The phoSim working directories (in $SCRATCH) are here (again, for the WFD field and r-filter):/global/cscratch1/sd/descpho/Pipeline-tasks/DC2-R1-2p-WFD-r
and similarly organized in subdirectories, one per visit.Real-time monitoring of the 12 workflows:
Each field (WFD and uDDF) and band (u,g,r,i,z,y) have a fixed number of visits per the following table.
Approx total sensor-visits = 158,766