LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

protoDC2 (Run 1.2p) phoSim operations log #65

Closed TomGlanzman closed 6 years ago

TomGlanzman commented 6 years ago

[updated for Run 1.2p] This issue will be a log/diary the operational progress of the protoDC2 phoSim image generation at NERSC. This issue is not intended to be a venue for discussing phoSim configuration (see, for example, #19, #33, #134, #140 and #163 ) or results. A few technical details about the workflow itself can be found here.

As data accumulate, you may find the image files in this directory tree (for the WFD field and r-filter): /global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output Each subdirectory corresponds to a single visit. The phoSim working directories (in $SCRATCH) are here (again, for the WFD field and r-filter): /global/cscratch1/sd/descpho/Pipeline-tasks/DC2-R1-2p-WFD-r and similarly organized in subdirectories, one per visit.

Real-time monitoring of the 12 workflows:

Each field (WFD and uDDF) and band (u,g,r,i,z,y) have a fixed number of visits per the following table.

Band Survey #Visits Mean #sensors/visit
u WFD 67 72
g WFD 91 67
r WFD 245 75
i WFD 223 73
z WFD 247 73
y WFD 252 72
u uDDF 192 88
g uDDF 138 88
r uDDF 138 88
i uDDF 137 88
z uDDF 136 88
y uDDF 135 88
- TOTAL 2001 79

Approx total sensor-visits = 158,766

TomGlanzman commented 6 years ago

Production has started, although it is not unlikely a problem will be discovered to cause a halt and restart from the beginning. A last minute decision: disable the phosim amplifier file output due to problems with those files.

TomGlanzman commented 6 years ago

The NERSC queues are not behaving nicely. A single-node 24-hour job submitted on Monday morning is now scheduled to run tomorrow at the earliest. To jump-start the process, a 10-node 8-hour KNL Pilot was submitted this afternoon and, lo!, it has started. There are now >300 raytrace processes running.

A question arose earlier about how long it might take to complete the ~8000 visits in protoDC2. Given that changes were made in the phoSim command file just yesterday, we won't have good data until sufficient sensor-visits have completed. Hopefully tomorrow...

We're off and running!

TomGlanzman commented 6 years ago

Thursday 21 Dec Update

Some success over the past 12 hours: 1865 sensor-visits completed (mostly on KNL) using the latest phoSim command file (ref #63). Based on these runtime performance statistics, a first estimate of total protoDC2 resource and time consumption may be made.

band #visits
u 534
g 776
r 1782
i 1795
z 1612
y 1581
Total 8080

Using average values:

This amount of processing could, under unrealistically ideal conditions, be performed in less than one week using ~1000 nodes. The trick will be keeping jobs running efficiently. The main challenges include:

Monitoring the workflow

The main r-filter workflow monitor is here. An experimental Pilot job monitor is here.

21:45 update - production is accelerating. All 1782 r-filter visits have been submitted. Current performance graphs for the raytrace step: plot-6

TomGlanzman commented 6 years ago

Saturday 23 Dec 2017 Update:

As of 07:30 PST, over 19,000 sensor-visits have been completed (representing ~3% of the anticipated 606k total sensor-visits in the current visit lists). The challenge has been keeping SLURM jobs running. Short jobs (8-10 hours) seem to start up within a few hours, but they suffer from serious inefficiency when they end -- taking many partially run raytrace jobs with them. Long jobs (24 hours) take many days in queue before they begin running.

When a block of cori-KNL nodes do begin to run, this is often accompanied by a set of failed raytrace jobs -- which fail when attempting to access one of the phoSim site or instrument files. These jobs fail almost immediately (hence, a small impact on efficiency) and can be easily rolled back. My guess is that the shock of starting hundreds of jobs simultaneously is putting a strain on the connection to the file system. An annoyance, but not (yet) serious. Side note: jobs of 50 KNL nodes represent the maximum size jobs yet submitted. As experience accrues, larger jobs will be submitted.

The plan going forward will be to attempt keeping a mix of short-running and long-running jobs in the cori queues in the hopes of improving overall utilization. A plot of KNL usage vs. time is beginning to take form here: https://portal.nersc.gov/project/lsst/glanzman/graph3.html

13:15 UPDATE: Due to issues described here, all jobs have been cancelled or held pending resolution. Production will eventually be restarted from the beginning

salmanhabib commented 6 years ago

@TomGlanzman Starting hundreds of jobs should not be a "-- strain on the connection to the files system." In principle, hundreds of jobs is nothing to worry about -- you should file a ticket with NERSC about this. Something is not working correctly.

TomGlanzman commented 6 years ago

Tuesday 16 Jan 2018 Update:

Updating workflow to reflect changes/fixes since the December run.

Changes:

  1. gcr-catalogs updated from github (master), checkout pre-newYear version, build aux file: git clone https://github.com/LSSTDESC/gcr-catalogs.git git checkout 204c504bd785fc9127a01c3c5f9a24640b3e7583 cd gcr-catalogs/GCRCatSimInterface/data source /global/common/software/lsst/cori-haswell-gcc/stack/setup_w_2017_46_py3_gcc6.sh setup lsst_sims python get_sed_mags.py

  2. generation of instanceCatalog option change from '--descqa_cat_file proto-dc2_v2.1.1' to protoDC2

  3. All previous output from the December 2017 trial run (Run 1.0a) has been temporarily moved aside in preparation for deletion. Please let me know if it is necessary to preserve these data from this early attempt.

The initial new data (notionally, Run 1.0b) will be generated from workflows DC2-phoSim-2-r version 1.000 and DC2-phoSim-2-i version 1.000. The first visits in each of these two bands are running now. Visits simulated are the same as those used in December (but only r-band exists for the December run), thus a comparison will be possible between Jan and Dec images.

The exact amount of data to be produced is still under discussion, although 3-4 visits in all six bands has been put forward.

TomGlanzman commented 6 years ago

Wednesday 17 Jan 2018 Update:

katrinheitmann commented 6 years ago

That is indeed unfortunate!

How about getting a reservation for this? I can send a quick email to Debbie and Peter to ask how quickly that can be set up (I got one within a week last year for a 6000 node reservation, so we should get something smaller much quicker).

Questions for you if this is viable: for how long would you need the reservation and for how many nodes? When would be a reasonable time for you to have the reservation (you would want to watch things carefully when they run so that the machine doesn't end up idling).

Please let me know if you think this is a good idea and we can start the process right now (well, almost right now).

Thanks, Katrin

On 1/17/18 12:32 PM, Tom Glanzman wrote:

Wednesday 17 Jan 2018 Update:

*

Run 1.0b jobs are running, but slowly. Individual sensor-visits
are currently taking >240 min (clock time) in the RayTrace step
(using 8 threads), so cannot be run in NERSC's "qos=interactive"
service 😢 . The average run time for this step (based on small
statistics) is ~255 minutes (rotten luck!) but with tails
extending to 350 min. Therefore, these steps must be done using
the normal (non-interactive) batch queue - which means waiting
many hours to up to a /week/ for jobs to start.

*

An easy way to monitor the overall progress is with this Pipeline
status page
<http://srs.slac.stanford.edu/Pipeline-II/exp/LSST-DESC/index.jsp?versionGroup=latestVersions&submit=Filter&d-4021922-s=1&d-4021922-o=2&taskFilter=DC2-phoSim&include=last30>,
the number in the "green check" column represents the number of
successfully completed visits for each filter.

*

For this run ("Run 1.0b"), five (5) visits for each of the six
filters will be simulated, both to test the phoSim configuration,
and the downstream pipeline. The new data are populating these
directories in
*NERSC:/global/projecta/projectdirs/lsst/production/DC2*:

DC2-phoSim-2-u/output DC2-phoSim-2-g/output DC2-phoSim-2-r/output DC2-phoSim-2-i/output DC2-phoSim-2-z/output DC2-phoSim-2-y/output

For each band's 'output' directory, there is one sub-directory per visit. The visit sub-directory name, e.g., 000000, is an index representing its order in the visit catalog. The visitID can be obtained by looking into the visit directory files, e.g., DC2-phoSim-2-r/output/000000 contains:

-rw-rw----+ 1 descpho lsst 886095 Jan 17 09:47 centroid_lsst_e_158370_f1_R01_S02_E000.txt -rw-rw----+ 1 descpho lsst 837341 Jan 17 09:51 centroid_lsst_e_158370_f1_R02_S22_E000.txt -rw-rw----+ 1 descpho lsst 1082843 Jan 17 10:10 centroid_lsst_e_158370_f1_R13_S01_E000.txt -rw-rw----+ 1 descpho lsst 826844 Jan 17 09:44 centroid_lsst_e_158370_f1_R13_S12_E000.txt -rw-rw----+ 1 descpho lsst 851356 Jan 17 09:35 centroid_lsst_e_158370_f1_R24_S02_E000.txt -rw-rw----+ 1 descpho lsst 25838119 Jan 17 09:47 lsst_e_158370_f1_R01_S02_E000.fits.gz -rw-rw----+ 1 descpho lsst 25201205 Jan 17 09:51 lsst_e_158370_f1_R02_S22_E000.fits.gz -rw-rw----+ 1 descpho lsst 25193942 Jan 17 10:10 lsst_e_158370_f1_R13_S01_E000.fits.gz -rw-rw----+ 1 descpho lsst 25125607 Jan 17 09:44 lsst_e_158370_f1_R13_S12_E000.fits.gz -rw-rw----+ 1 descpho lsst 25070875 Jan 17 09:35 lsst_e_158370_f1_R24_S02_E000.fits.gz

The file lsst_e_158370_f1_R01_S02_E000.fits.gz is an image file for visit 158370, using filter 1 ('r'), for sensor R01_S02, and 'snap' 000 (only one snap per visit for this project).

  • One may also compare the results of Run 1.0a (December) with these new data. The old data, r-band only, reside here: /global/projecta/projectdirs/lsst/production/DC2/old.Dec2017/DC2-phoSim-2-r/output, one visit per sub-directory, as above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-358398794, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jMlIVJFMjFlbEJOBTipAJpqIys-yks5tLjyygaJpZM4RHrHX.

TomGlanzman commented 6 years ago

A reservation is a possibility, although the amount of lead time required for a reservations is comparable to the time awaiting for batch job to run. At the moment, there are (surprisingly) 46 KNL nodes running which will handle Run 1.0b (5 visits x 6 filters).

katrinheitmann commented 6 years ago

Hi Tom,

ok then. Usually the machines are much less busy in January because lots of new projects start and people are not immediately ready to go for full up runs. So not too surprising. But that's great that these are then available for testing already.

On 1/17/18 2:56 PM, Tom Glanzman wrote:

A reservation is a possibility, although the amount of lead time required for a reservations is comparable to the time awaiting for batch job to run. At the moment, there are (surprisingly) 46 KNL nodes running which will handle Run 1.0b (5 visits x 6 filters).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-358442266, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jKfBMz1EK3y32iLGtDlGqC6gHG0eks5tLl51gaJpZM4RHrHX.

TomGlanzman commented 6 years ago

Thursday 18 Jan 2018 Update:

Much progress for Run 1.0b (five visits for each of six filters). As of 07:50 PST fully 83% of all sensor-visits have successfully completed. Run statistics are beginning to shape up and currently look like this:

Task sensor-visits mean clock time (Raytrace)
DC2-phoSim-2-u 422/431 complete 219 +/- 71 min
DC2-phoSim-2-g 509/512 complete 214 +/- 37 min
DC2-phoSim-2-r 379/380 complete 260 +/- 46 min
DC2-phoSim-2-i 224/224 complete 234 +/- 62 min
DC2-phoSim-2-z 142/377 complete 705 +/- 50 min
DC2-phoSim-2-y 336/500 complete 356 +/- 124 min
Total 2012/2424 complete (83%)
TomGlanzman commented 6 years ago

Friday 19 Jan 2018 Update:

As of 16:25 PST, there are 31 sensor-visits still running. These stragglers are mostly in the z-band, with a couple in the y-band. These two bands are requiring significantly more CPU effort per visit than the other bands. The other four bands are complete (5 visits each).

TomGlanzman commented 6 years ago

Monday 22 Jan 2018 Update:

The final sensor visits completed early Sunday morning (yesterday), so Run 1.0b is complete.

However: Due to a speckling issue, a new version of phoSim has been released (v3.7.7). Heather has installed the new code at NERSC and the plan is to reprocess the very first r-band visit, DC2-phoSim-2-r stream 000000, visitID 151687. The old (v3.7.6) data has been moved aside into this directory: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000.v3.7.6 while the new (v3.7.7) data will be placed here: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000

Jobs are running now with an ETA to completion around 20:00 Pacific this evening.

Note: there may be yet another phoSim code release as the root cause of the speckling is understood and fixed.

TomGlanzman commented 6 years ago

Tuesday 23 Jan 2018 Update:

The test jobs (a single r-band visit) using phoSim v3.7.7 completed last evening but continue to show the speckling problem. The PhoSim team has now reproduced the problem and will advise when new code is available for testing.

In the mean time, continue preparations for Run 1.1...


(LATER) The PhoSim team has released v3.7.8. Heather has installed. Rolling back the first visit in the r-band (visitID 151687). The previous test output has moved aside into: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000.v3.7.7 while the new (v3.7.8) data will be placed: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000

Job started at 10:37 so will likely be a few hours before the first sensor-visit rolls off the assembly line.


NOTE: I discovered that the method used to determine the type of processor used (haswell vs knl) started to fail just after the new allocation year changes went into effect on January 9th. Thus, all of Run 1.0b raytrace jobs which ran on KNL used code optimized for haswell. The only impact is execution time. This problem has now been fixed.

johnrpeterson commented 6 years ago

Fixed, Tom. Please use phosim v3.7.8.

TomGlanzman commented 6 years ago

Wednesday 24 Jan 2018 Update:

A test of phoSim v3.7.8 indicates the speckling problem seems to have been solved, see (https://github.com/LSSTDESC/DC2_Repo/issues/69#issuecomment-359990670) and #105 ). Thanks to the PhoSim team!

No production running today - only various tests (#101) and development (#82) in preparation for Run 1.1

TomGlanzman commented 6 years ago

Friday 16 Feb 2018 Update:

Production for Run 1.1 (phosim) is imminent. Initial test of revamped instanceCatalog generation, dynamic SEDs, updated visit lists, and other config changes and bug fixes has been completed for a single visit in the r-band (WFD). Data products for this test are here:

/global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-3_WFD-r/output/000001

Note that this test is NOT the final configuration as a final tag of DC2_repo, and agreement on the phoSim "--fov" parameter is needed -- so these data will be overwritten with production data in the near future.

Run 1.1 consists of the the following visits.

Band Survey #Visits
u WFD 67
g WFD 91
r WFD 245
i WFD 223
z WFD 247
y WFD 252
u uDDF 192
g uDDF 138
r uDDF 138
i uDDF 137
z uDDF 136
y uDDF 135
- TOTAL 2001
TomGlanzman commented 6 years ago

A new DC2 phoSim test visit has been completed, obsHistID=181866 with 63 sensors simulated. Data products are here: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-3_WFD-r/output/000002 . The biggest change since yesterday's test was the change in the --fov parameter from 0.5 to 2.1 (degrees) during instanceCatalog generation. Please have a look.

TomGlanzman commented 6 years ago

Monday 19 Feb 2018 Update:

Ducks are lined up, so begin a production for Run 1.1, starting with r-band, WFD.

TomGlanzman commented 6 years ago

Tuesday 20 Feb 2018 Update:

Production is ramping up. All 245 visits for r-band WFD have been submitted. Twenty-five for each of the remaining 11 configurations have also been submitted.

There have been a few failures: 1) the expected failures when a Pilot terminates (normally, after it times-out); and, 2) the familiar issue of phoSim reading one of its many files during initialization. This latter issue arises when many phoSim instances attempt to start (near) simultaneously. A simple rollback solves the problem.

At 20:50 today, 425 sensor-visits have completed and 1904 are running. I expect the number of running to increase as new KNL Pilot jobs start to run.

Here is a link to view all 12 workflows on one page.

Addendum: A new failure is beginning to appear in the instanceCatalog generation. This has been reported here.

As the starting benchmark, our NERSC allocation now stands at:

19:51 2/20/2019  m1727 cpu balance = 94,572,858.2
TomGlanzman commented 6 years ago

Thanks @johannct, I actually provided tricklestream to Stephan some years ago so am quite familiar with its use. In this case, we do not yet know enough to use this tool effectively -- we are still in the process of discovering the limits of the instanceCatalog generation step. Also, given that DC2-phoSim consists of 12 independent pipeline tasks, trickleStream will need some work (and thought) to make it useful.

TomGlanzman commented 6 years ago

Wednesday 21 Feb 2018 Update:

Production continues to ramp up. At the moment (09:00) there are 27 fully completed visits and 4128 completed sensor-visits (which includes both fully and partially completed visits). 1561 sensor-visits are currently running on 46 KNL nodes (about 34 phoSim instances/node). Another 200 nodes have been requested but are not yet running.

A couple of operation issues arose in the last 12 hours.

135 solved and closed

136 open

10:25 - create a new DBstaging area in $SCRATCH and make copies of the agn and minion(obssim) databases, then point the instanceCat generator to use those copies. Let's see if that improves scalability...

12:32 - update the agn database after @danielsf modifies the file to eliminate the creation of two temporary files (which were creating problems elsewhere).

16:10 - All remaining visits for the 12 visit categories have now been submitted. The program is: submit Pilots; wait; rollback; repeat. Currently, 48 visits (of 2001) have completed.

TomGlanzman commented 6 years ago

Thursday 22 Feb 2018 Update:

As of 09:00 there are 66 fully completed visits (3% of Run 1.1's 2001 visits) with many more in progress. Cori utilization has been rather low the past 24 hours: a max of 46 nodes dropping down to a mere 8 for several hours this morning. About 1000 nodes have been requested, so we simply must wait for SLURM to let them run.

Good news is that both issues opened yesterday have been solved and closed. In addition to the expected job terminations due to Pilot time-outs, a new failure is starting to appear: segmentation faults in phoSim's atmosphere creator, which seems correlated with many simultaneous instances (in the 30-50 range). This problem acts as a transient, generally succeeding upon rollback, but bears watching.

Handy links (note some are updated only every few minutes):

TomGlanzman commented 6 years ago

Friday 23 Feb 2018 Update:

As of 10:30 today, there are 76 fully completed visits representing 4% of total. Production has ground to a virtual halt as the small set of Pilot jobs that started running 2 days ago have all completed. Submitted Pilot jobs are just not running! There are jobs submitted on the 19th that are still waiting for a chance to run. SLURM indicates that up to 170 may come available in ~9.4 hours, but those estimates are largely unreliable. Perhaps there will be a surge of activity starting this evening. In the meantime, I am using various tricks to prepare the maximum number of waiting 'raytrace' steps as possible.

While waiting, I put together an experimental task monitoring page to give me a better global view of what the 12 tasks are up to. It is not very pretty but it's a start. (Hint: the good stuff is at the bottom.)

drphilmarshall commented 6 years ago

Thanks @TomGlanzman ! Sorry to hear we are not yet up to LSST data rates - but then I expect commissioning might be a bit like this too ;-)

While we are waiting for the conditions to improve, would you mind breaking down that 4% by band and survey please? Here's an extended version of the table you made further up the thread, is it easy for you to fill in (or automagically over-write) the table below? You can edit this message no problem, or just paste in your own version in a subsequent comment. I'm interested to see what kind of DM processing we could be doing. Thanks!

Band Survey Target # Visits Completed # Visits % complete
u WFD 67 0 0
g WFD 91 2 2
r WFD 245 67 27
i WFD 223 1 .4
z WFD 247 4 1.6
y WFD 252 2 .8
------- --------- --------------- ------------------- -------------
u DDF 192 0 0
g DDF 138 0 0
r DDF 138 0 0
i DDF 137 0 0
z DDF 136 0 0
y DDF 135 0 0
------- --------- --------------- ------------------- -------------
- TOTAL 2001 76

(Note from Tom: the values in the "Completed # Visits" column may be easily obtained from this workflow web page.)

TomGlanzman commented 6 years ago

Saturday 24 Feb 2018 Update

Not much happening today -- Pilot jobs submitted last Monday still have not started. Consider:

JOBID     ST  USER     NAME         NODES REQUESTED USED  SUBMIT               QOS        SCHEDULED_START      FEATURES        REASON    
10393524  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-19T19:31:24  regular_1  2018-02-26T19:40:00  knl&quad&cache  Resources
10414515  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:28  regular_1  2018-02-26T19:40:00  knl&quad&cache  Resources
10414517  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:29  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414518  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:30  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414519  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:31  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414520  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:32  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414530  PD  descpho  phoSimK-50*  50    48:00:00  0:00  2018-02-20T19:48:43  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources

Job submitted last Monday are currently scheduled to run next Monday!

In the meantime, I am continuing to run the catalog 'trim' bit manually on a KNL interactive node (max 4 hour limit).

TomGlanzman commented 6 years ago

Sunday 25 Feb 2018 Update

More of the same (see yesterday's report). The priorities at NERSC seem completely given over to huge MPI jobs - at the expense of the type we need for DC2-phoSim production... :(

salmanhabib commented 6 years ago

We need to bug someone at NERSC; the queueing system was always something I had complained about. I will send an email to Richard Gerber.

TomGlanzman commented 6 years ago

Monday 26 Feb 2018 Update

The first half of the day was a repeat of yesterday -- no activity. Then at 5 minutes 'til noon, the first large SLURM jobs began to run. At first, two jobs (each with 20 nodes), followed by another job of 20 nodes. With these 60 nodes, about 2000 single sensor-visits (raytrace) jobs are running.

The task summary page has been updated and is now a bit easier to read. The tables indicate the overall progress of the DC2-phoSim Run 1.1.


By the end of the day, 170 KNL nodes were online with phoSim and each node has been reserved for 48 hours. Good news!

salmanhabib commented 6 years ago

I sent an email to Richard Gerber yesterday -- let's see what he says.

salmanhabib commented 6 years ago

Ok, Richard got back to me. He will get someone to talk to Tom; what they want is an estimate of the required throughput from us. You can provide that, right, Tom?

TomGlanzman commented 6 years ago

Tuesday 27 Feb 2018 Update

TomGlanzman commented 6 years ago

Wednesday 28 Feb 2018 Update

An interesting conversation with a couple of NERSC folks this afternoon. One reason for the poor SLURM response this past weekend was due to a massive 9,000 node MPI job that ran for several days. (Cori-KNL has a total of about 9,800 nodes, so this had a serious impact on the rest of us.) Other than that, they reaffirmed that the only strategies for better throughput they can offer include:

  1. Limit [Pilot] jobs to run fewer hours (DC2 jobs request the maximum of 48 hours to be efficient)
  2. Make a reservation (this requires one to specify an exact # nodes and run time, plus up to several days for nodes to drain to make room...if we do not see some job action by tomorrow mid-day, I may request a reservation, although this raises operational issues)

Neither of these two strategies is a silver bullet and will end up costing our allocation.

I also mentioned the lack of accuracy in the SLURM job dispatch estimate (in response to the 'sqs' command, for example), the burstiness of SLURM dispatch, the inability to execute a controlled ramp-up of production, and inability to control the rate at which jobs start. Maybe a fresh look will help solve some of these problems.

TomGlanzman commented 6 years ago

Thursday 1 Mar 2018 Update

While the number of Pilot jobs is decreasing after running out the 48-hour clock, SLURM has not been kind to us in replenishing this effort. There remain dozens of jobs being held hostage in queue, some for over one week. Perhaps Cori is preparing of another mega-job and not allowing our 48-hour jobs to backfill... The question now is whether to request a 'reservation' - the request for which would take several days to process? From today's standpoint, something like 300-400 nodes for 2-3 days would make a huge dent in the remaining work to be done. But one cannot predict whether or how many of the existing queued jobs may start to run before a reservation could be put in place.

JOBID             ST   USER         NAME         NODES REQUESTED     USED         SUBMIT                   QOS              SCHEDULED_START       FEATURES         REASON                               
10601130          PD   stanier      ikink        2048   24:00:00     0:00         2018-02-28T21:07:31   regular_0           2018-03-02T14:00:00   knl&quad&cache   Resources
10535604          PD   u6338        fullKahuna*  5600   36:00:00     0:00         2018-02-26T13:33:57   regular_0           2018-03-02T14:00:01   knl&quad&cache   Resources
10599218          R    heitmann     test_knowh*  6144   8:00:00      28:45        2018-02-28T19:46:13   regular_0           2018-03-01T13:39:41   knl&quad&cache   None 
10599371          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:54:40   regular_0           2018-03-04T02:00:00   knl&quad&cache   Resources
10599382          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:55:00   regular_0           2018-03-05T02:00:00   knl&quad&cache   Resources
10599385          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:55:07   regular_0           avail_in_~0.1_hrs     knl&quad&cache   Priority
10599394          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:55:18   regular_0           avail_in_~0.1_hrs     knl&quad&cache   Priority
10535600          PD   u6338        allKahuna_*  9200   2:00:00      0:00         2018-02-26T13:33:53   regular_0           2018-03-02T11:17:09   knl&quad&cache   Resources

There is a 9,200 node mega-Job but it runs only for 2 hours. The five 6,144 node jobs, however, represent 63% of Cori-KNL, and four of them run for 24 hours each - which may be our biggest competition for the the coming days.

TomGlanzman commented 6 years ago

Friday 2 Mar 2018 Update

There are 550 complete visits, basically where we were yesterday. NERSC has ramped us down to a mere 40 nodes. There are 41 Pilots (880 nodes) waiting -- for up to 9 days -- to start running. So, while essentially idle, we wait...........

TomGlanzman commented 6 years ago

Weekend 3-4 Mar 2018 Update

TomGlanzman commented 6 years ago

Monday 5 Mar 2018 Update

Run 1.1, although incomplete, is winding down. 713 visits are complete (36%).

Preparations for phoSim testing is underway in #140

TomGlanzman commented 6 years ago

Monday 2 Apr 2018 Update

Testing to resume production has gotten underway. With new phoSim background parameters and updated catalog generation, this new project is dubbed "Run 1.2p". A single visit with a non-production catalog generation is running here.

A NERSC 'reservation' has been requested (100 KNL nodes for 24-hours) to help jump-start this project. If granted, this reservation will begin sometime Thursday, 5 Apr 2018. Hopefully all production code will be in place by that time.

TomGlanzman commented 6 years ago

Tuesday 3 Apr 2018 Update

09:45 - The first Run 1.2p test run continues and the first six (of 32) sensors for this visit have completed. A first quick look at the FITS files indicates reasonable image files. (Chris W's worry about commented overrides being interpreted by phoSim did not come to pass; the FITS headers show that those commented background parameters were properly ignored.) There is a ~x2 difference in background across the field of view in these first few sensors. Interested parties are invited to look at both 'electron' and 'amplifier/ADC' image files which continue to populate this directory: /global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output/000000

11:00 - All 32 sensor-visits for the test vist have completed

TomGlanzman commented 6 years ago

Friday 6 Apr 2018 Update

DC2 Run 1.2p began yesterday evening! Five single-node 'premium' class SLURM Pilot jobs were submitted to get the ball rolling.

A fortuitous decision to "hold" rather than "cancel" a batch of SLURM jobs from Run 1.1p in February has lead to the discovery that once released, these jobs start very quickly. There are 31 such "held" jobs and they collectively represent 680 nodes each for 48 hours. Eleven of these 20-node jobs have been released and are now running. A good way to start the production!

11:00 status

katrinheitmann commented 6 years ago

Excellent news! Glad we gave this a try rather than killing the jobs back then.

On 4/6/18 12:59 PM, Tom Glanzman wrote:

Friday 6 Apr 2018 Update

DC2 Run 1.2p began yesterday evening! Five single-node 'premium' class SLURM Pilot jobs were submitted to get the ball rolling.

A fortuitous decision to "hold" rather than "cancel" a batch of SLURM jobs from Run 1.1p in February has lead to the discovery that once released, these jobs start very quickly. There are 31 such "held" jobs and they collectively represent 680 nodes each for 48 hours. Eleven of these 20-node jobs have been released and are now running. A good way to start the production!

11:00 status

  • 16 SLURM Pilot jobs running on 225 hosts (14,400 cores)
  • 115 sensor-visits complete
  • 7350 jobs running

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-379330248, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jH5jTCF6_dtT9XKvmNu1YLFqdNSYks5tl60ZgaJpZM4RHrHX.

TomGlanzman commented 6 years ago

Saturday 7 Apr 2018 Update

Thanks to those old SLURM jobs, production continues to run at a good pace. Fresh jobs submitted yesterday are slowly moving up in the queue. We have a 100-node 24-hour reservation coming up Monday morning at 10am. With all of these factors, production may smoothly continue even as the old jobs terminate.

13:00 status:

TomGlanzman commented 6 years ago

Monday 9 Apr 2018 Update

A productive weekend. After a reasonably stable weekend, will begin again to ramp up the number of nodes. A 100-node 24-hour reservation begins today at 10am.

08:00 status:

TomGlanzman commented 6 years ago

Tuesday 10 Apr 2018 Update

A good 24 hours of normal running, increasing the number of nodes to 760.

07:50 status:

For the WDF y-band workflow, the execution time distribution for the successful 622 sensor visits looks like this: wfd-y

For the WDF z-band, only 36 of 1009 attempts successfully completed. The execution time distribution appears below: wfd-z

15:00 status: Run 1.2p is losing cori-knl nodes; as jobs complete new jobs are not starting quickly. Thus, only 470 nodes are currently active (although 2461 are requested via SLURM).

TomGlanzman commented 6 years ago

Wednesday 11 Apr 2018 Update

The ramp-up of Run 1.2p has stalled and even lost momentum. This is because NERSC is not running jobs that have now been in queue for over five (5) days. At 09:45 I released the final set of jobs submitted in February (on the 22nd), so will see if those 3 jobs of 20 nodes each start running soon.

09:45 status:

TomGlanzman commented 6 years ago

Thursday 12 Apr 2018 Update

As of this morning the pipeline has completely run dry: there are zero jobs running at present. Despite having >3000 nodes of batch jobs in the queues, some of which have been waiting for nearly a week, the NERSC scheduler is not providing computing resources, so production has ground to a full stop.

07:45 status:

09:30 update: The DC2 production has stopped due to full machine reservations associated with the yearly Gordon Bell challenge, for which the deadline is this coming Sunday. The current set of reservations started this morning at 09:00 and last for 27 hours. Hopefully DC2 jobs will again start to run shortly after noon tomorrow (Friday).

Noon update: The full machine reservations were cancelled at the last moment. Jobs have started to run again. We are currently up to 480 nodes ...oops, make that 430 (apparently 50 nodes crashed!)

TomGlanzman commented 6 years ago

Friday 13 Apr 2018 Update

Production has now been running for one week. After a 2-day ramp-down, an 8-hour outage and the false threat of a lengthier (27-hour 'reservation') outage to follow, production quickly ramped back up to just under 500 nodes where it has remained since yesterday noon.

07:40 status:

Note that ~4M NERSC hours have been spent on Run 1.2p thus far and it is nearly 25% complete.

TomGlanzman commented 6 years ago

Saturday 14 Apr 2018 Update

Something bad happened just after midnight this morning which caused 600 nodes to crash (at least that what the clues point to). Further, whatever caused this to happen continued to plague new jobs all the way up to 08:00 this morning. The end result is that >81,000 jobs were abnormally terminated. It is a technical challenge simply to roll back such a large number of jobs :( (A NERSC ticket has been submitted to inquire about this episode but may not hear back until Monday.)

13:40 status:

TomGlanzman commented 6 years ago

Sunday 15 April 2018 Update

Recovery from yesterday's mishap and, perhaps surprisingly, continued ramp up.

13:50 status:

18:00 status:

TomGlanzman commented 6 years ago

Monday 16 Apr 2018 Update

A surprisingly good weekend -- after recovering from the Saturday morning meltdown.

07:45 status:


Another look at our NERSC allocation... Between 5 April and today, 9,957,846 billable NERSC hours have been consumed. In this time, 54,929 raytrace jobs have been successfully completed, representing about 34% of the total. Assuming that Run 1.2p is the only significant consumer of resources, one might conclude that completing this run might require roughly 30M NERSC hours. Why is this number so high?

Firstly, in addition to the 54,929 successful raytrace jobs, there were also 12,497 long-running jobs that timed out after 48 hours and, thus, represent wasted effort because they will all need to be started again from the beginning. [We are currently running 33 instances of raytrace per Cori-KNL node, thus a single timed-out instance of raytrace consumes 48*96/33 = 140 NERSC hours. 12,497 such jobs thus represent 1.7M NERSC hours, or about 17.5% of the total.]

Secondly, whenever a Pilot job times out (after 48 hours), any partially completed jobs crash and must be restarted (unless that job ran for the full 48 hours, in which case it is ignored at present). Given that Run 1.2p jobs are taking 20 to >48 hours to complete, there is a significant inefficiency, possibly as high as 40%.