TomGlanzman commented 6 years ago

[updated for Run 1.2p] This issue will be a log/diary the operational progress of the protoDC2 phoSim image generation at NERSC. This issue is not intended to be a venue for discussing phoSim configuration (see, for example, #19, #33, #134, #140 and #163 ) or results. A few technical details about the workflow itself can be found here.

As data accumulate, you may find the image files in this directory tree (for the WFD field and r-filter): /global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output Each subdirectory corresponds to a single visit. The phoSim working directories (in $SCRATCH) are here (again, for the WFD field and r-filter): /global/cscratch1/sd/descpho/Pipeline-tasks/DC2-R1-2p-WFD-r and similarly organized in subdirectories, one per visit.

Real-time monitoring of the 12 workflows:

Each field (WFD and uDDF) and band (u,g,r,i,z,y) have a fixed number of visits per the following table.

Band	Survey	#Visits	Mean #sensors/visit
u	WFD	67	72
g	WFD	91	67
r	WFD	245	75
i	WFD	223	73
z	WFD	247	73
y	WFD	252	72
u	uDDF	192	88
g	uDDF	138	88
r	uDDF	138	88
i	uDDF	137	88
z	uDDF	136	88
y	uDDF	135	88
-	TOTAL	2001	79

Approx total sensor-visits = 158,766

TomGlanzman commented 6 years ago

Production has started, although it is not unlikely a problem will be discovered to cause a halt and restart from the beginning. A last minute decision: disable the phosim amplifier file output due to problems with those files.

TomGlanzman commented 6 years ago

The NERSC queues are not behaving nicely. A single-node 24-hour job submitted on Monday morning is now scheduled to run tomorrow at the earliest. To jump-start the process, a 10-node 8-hour KNL Pilot was submitted this afternoon and, lo!, it has started. There are now >300 raytrace processes running.

A question arose earlier about how long it might take to complete the ~8000 visits in protoDC2. Given that changes were made in the phoSim command file just yesterday, we won't have good data until sufficient sensor-visits have completed. Hopefully tomorrow...

We're off and running!

TomGlanzman commented 6 years ago

Thursday 21 Dec Update

Some success over the past 12 hours: 1865 sensor-visits completed (mostly on KNL) using the latest phoSim command file (ref #63). Based on these runtime performance statistics, a first estimate of total protoDC2 resource and time consumption may be made.

The mean wall-clock time for the KNL raytrace step is observed to be ~270 minutes (with a tail extending to 470 min).
The first 100 visits (r-filter) spawned a total of 7523 sensor-visits, or 75 sensors/visit on average (40% of the focal plane)
"protoDC2" contains 8080 visits per the following table.

band	#visits
u	534
g	776
r	1782
i	1795
z	1612
y	1581
Total	8080

Using average values:

8080 visits * 75 sensors/visit = 606,000 sensor-visits to simulate
Raytrace is run with 8-threads/instance, and 34 instances/KNL node, completely filling the available 272 hardware threads
Total KNL node-hours = 606,000 sensor-visits (250 min/sensor-visit) (1 hr/60 m) * (8 threads) / (272 threads/node) = 74265 node-hours = 3094 node-days
Total NERSC-hours = 74265 node-hours * (96 NERSC-hours/node-hour) = 7.1M NERSC-hours (this assumes no wasted processing -- which will not be the case)

This amount of processing could, under unrealistically ideal conditions, be performed in less than one week using ~1000 nodes. The trick will be keeping jobs running efficiently. The main challenges include:

Getting SLURM jobs to run
There is some built-in inefficiency due to jobs terminating when the SLURM job times out (they must then be rerun from the beginning)
SLURM queue dwell times rise quickly with requested job length, thus making shorter jobs preferable from a dispatch perspective, but exacerbating the terminating job inefficiency
The instanceCatalog generation is delicate and must be done so as to not overload the UW server, but performed sufficiently quickly to keep the processing pipeline happy
The SLAC holiday power outage (26th-30th Dec) will impact production
I will be traveling and unable to focus 100% on this project over the coming 2 weeks.

Monitoring the workflow

The main r-filter workflow monitor is here. An experimental Pilot job monitor is here.

21:45 update - production is accelerating. All 1782 r-filter visits have been submitted. Current performance graphs for the raytrace step: plot-6

TomGlanzman commented 6 years ago

Saturday 23 Dec 2017 Update:

As of 07:30 PST, over 19,000 sensor-visits have been completed (representing ~3% of the anticipated 606k total sensor-visits in the current visit lists). The challenge has been keeping SLURM jobs running. Short jobs (8-10 hours) seem to start up within a few hours, but they suffer from serious inefficiency when they end -- taking many partially run raytrace jobs with them. Long jobs (24 hours) take many days in queue before they begin running.

When a block of cori-KNL nodes do begin to run, this is often accompanied by a set of failed raytrace jobs -- which fail when attempting to access one of the phoSim site or instrument files. These jobs fail almost immediately (hence, a small impact on efficiency) and can be easily rolled back. My guess is that the shock of starting hundreds of jobs simultaneously is putting a strain on the connection to the file system. An annoyance, but not (yet) serious. Side note: jobs of 50 KNL nodes represent the maximum size jobs yet submitted. As experience accrues, larger jobs will be submitted.

The plan going forward will be to attempt keeping a mix of short-running and long-running jobs in the cori queues in the hopes of improving overall utilization. A plot of KNL usage vs. time is beginning to take form here: https://portal.nersc.gov/project/lsst/glanzman/graph3.html

13:15 UPDATE: Due to issues described here, all jobs have been cancelled or held pending resolution. Production will eventually be restarted from the beginning

salmanhabib commented 6 years ago

@TomGlanzman Starting hundreds of jobs should not be a "-- strain on the connection to the files system." In principle, hundreds of jobs is nothing to worry about -- you should file a ticket with NERSC about this. Something is not working correctly.

TomGlanzman commented 6 years ago

Tuesday 16 Jan 2018 Update:

Updating workflow to reflect changes/fixes since the December run.

Changes:

gcr-catalogs updated from github (master), checkout pre-newYear version, build aux file: git clone https://github.com/LSSTDESC/gcr-catalogs.git git checkout 204c504bd785fc9127a01c3c5f9a24640b3e7583 cd gcr-catalogs/GCRCatSimInterface/data source /global/common/software/lsst/cori-haswell-gcc/stack/setup_w_2017_46_py3_gcc6.sh setup lsst_sims python get_sed_mags.py
generation of instanceCatalog option change from '--descqa_cat_file proto-dc2_v2.1.1' to protoDC2
All previous output from the December 2017 trial run (Run 1.0a) has been temporarily moved aside in preparation for deletion. Please let me know if it is necessary to preserve these data from this early attempt.

The initial new data (notionally, Run 1.0b) will be generated from workflows DC2-phoSim-2-r version 1.000 and DC2-phoSim-2-i version 1.000. The first visits in each of these two bands are running now. Visits simulated are the same as those used in December (but only r-band exists for the December run), thus a comparison will be possible between Jan and Dec images.

The exact amount of data to be produced is still under discussion, although 3-4 visits in all six bands has been put forward.

TomGlanzman commented 6 years ago

Wednesday 17 Jan 2018 Update:

Run 1.0b jobs are running, but slowly. Individual sensor-visits are currently taking >240 min (clock time) in the RayTrace step (using 8 threads), so cannot be run in NERSC's "qos=interactive" service :cry: . The average run time for this step (based on small statistics) is ~255 minutes (rotten luck!) but with tails extending to 350 min. Therefore, these steps must be done using the normal (non-interactive) batch queue - which means waiting many hours to up to a week for jobs to start.
An easy way to monitor the overall progress is with this Pipeline status page, the number in the "green check" column represents the number of successfully completed visits for each filter.

For this run ("Run 1.0b"), five (5) visits for each of the six filters will be simulated, both to test the phoSim configuration, and the downstream pipeline. The new data are populating these directories in NERSC:/global/projecta/projectdirs/lsst/production/DC2:

DC2-phoSim-2-u/output
DC2-phoSim-2-g/output
DC2-phoSim-2-r/output
DC2-phoSim-2-i/output
DC2-phoSim-2-z/output
DC2-phoSim-2-y/output

For each band's 'output' directory, there is one sub-directory per visit. The visit sub-directory name, e.g., 000000, is an index representing its order in the visit catalog. The visitID can be obtained by looking into the visit directory files, e.g., DC2-phoSim-2-r/output/000000 contains:

-rw-rw----+ 1 descpho lsst   886095 Jan 17 09:47 centroid_lsst_e_158370_f1_R01_S02_E000.txt
-rw-rw----+ 1 descpho lsst   837341 Jan 17 09:51 centroid_lsst_e_158370_f1_R02_S22_E000.txt
-rw-rw----+ 1 descpho lsst  1082843 Jan 17 10:10 centroid_lsst_e_158370_f1_R13_S01_E000.txt
-rw-rw----+ 1 descpho lsst   826844 Jan 17 09:44 centroid_lsst_e_158370_f1_R13_S12_E000.txt
-rw-rw----+ 1 descpho lsst   851356 Jan 17 09:35 centroid_lsst_e_158370_f1_R24_S02_E000.txt
-rw-rw----+ 1 descpho lsst 25838119 Jan 17 09:47 lsst_e_158370_f1_R01_S02_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25201205 Jan 17 09:51 lsst_e_158370_f1_R02_S22_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25193942 Jan 17 10:10 lsst_e_158370_f1_R13_S01_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25125607 Jan 17 09:44 lsst_e_158370_f1_R13_S12_E000.fits.gz
-rw-rw----+ 1 descpho lsst 25070875 Jan 17 09:35 lsst_e_158370_f1_R24_S02_E000.fits.gz

The file lsst_e_158370_f1_R01_S02_E000.fits.gz is an image file for visit 158370, using filter 1 ('r'), for sensor R01_S02, and 'snap' 000 (only one snap per visit for this project).

One may also compare the results of Run 1.0a (December) with these new data. The old data, r-band only, reside here: /global/projecta/projectdirs/lsst/production/DC2/old.Dec2017/DC2-phoSim-2-r/output, one visit per sub-directory, as above.

katrinheitmann commented 6 years ago

That is indeed unfortunate!

How about getting a reservation for this? I can send a quick email to Debbie and Peter to ask how quickly that can be set up (I got one within a week last year for a 6000 node reservation, so we should get something smaller much quicker).

Questions for you if this is viable: for how long would you need the reservation and for how many nodes? When would be a reasonable time for you to have the reservation (you would want to watch things carefully when they run so that the machine doesn't end up idling).

Please let me know if you think this is a good idea and we can start the process right now (well, almost right now).

Thanks, Katrin

On 1/17/18 12:32 PM, Tom Glanzman wrote:

Wednesday 17 Jan 2018 Update:

*
Run 1.0b jobs are running, but slowly. Individual sensor-visits
are currently taking >240 min (clock time) in the RayTrace step
(using 8 threads), so cannot be run in NERSC's "qos=interactive"
service 😢 . The average run time for this step (based on small
statistics) is ~255 minutes (rotten luck!) but with tails
extending to 350 min. Therefore, these steps must be done using
the normal (non-interactive) batch queue - which means waiting
many hours to up to a /week/ for jobs to start.
*
An easy way to monitor the overall progress is with this Pipeline
status page
<http://srs.slac.stanford.edu/Pipeline-II/exp/LSST-DESC/index.jsp?versionGroup=latestVersions&submit=Filter&d-4021922-s=1&d-4021922-o=2&taskFilter=DC2-phoSim&include=last30>,
the number in the "green check" column represents the number of
successfully completed visits for each filter.
*
For this run ("Run 1.0b"), five (5) visits for each of the six
filters will be simulated, both to test the phoSim configuration,
and the downstream pipeline. The new data are populating these
directories in
*NERSC:/global/projecta/projectdirs/lsst/production/DC2*:
DC2-phoSim-2-u/output DC2-phoSim-2-g/output DC2-phoSim-2-r/output DC2-phoSim-2-i/output DC2-phoSim-2-z/output DC2-phoSim-2-y/output

For each band's 'output' directory, there is one sub-directory per visit. The visit sub-directory name, e.g., 000000, is an index representing its order in the visit catalog. The visitID can be obtained by looking into the visit directory files, e.g., DC2-phoSim-2-r/output/000000 contains:

-rw-rw----+ 1 descpho lsst 886095 Jan 17 09:47 centroid_lsst_e_158370_f1_R01_S02_E000.txt -rw-rw----+ 1 descpho lsst 837341 Jan 17 09:51 centroid_lsst_e_158370_f1_R02_S22_E000.txt -rw-rw----+ 1 descpho lsst 1082843 Jan 17 10:10 centroid_lsst_e_158370_f1_R13_S01_E000.txt -rw-rw----+ 1 descpho lsst 826844 Jan 17 09:44 centroid_lsst_e_158370_f1_R13_S12_E000.txt -rw-rw----+ 1 descpho lsst 851356 Jan 17 09:35 centroid_lsst_e_158370_f1_R24_S02_E000.txt -rw-rw----+ 1 descpho lsst 25838119 Jan 17 09:47 lsst_e_158370_f1_R01_S02_E000.fits.gz -rw-rw----+ 1 descpho lsst 25201205 Jan 17 09:51 lsst_e_158370_f1_R02_S22_E000.fits.gz -rw-rw----+ 1 descpho lsst 25193942 Jan 17 10:10 lsst_e_158370_f1_R13_S01_E000.fits.gz -rw-rw----+ 1 descpho lsst 25125607 Jan 17 09:44 lsst_e_158370_f1_R13_S12_E000.fits.gz -rw-rw----+ 1 descpho lsst 25070875 Jan 17 09:35 lsst_e_158370_f1_R24_S02_E000.fits.gz

The file lsst_e_158370_f1_R01_S02_E000.fits.gz is an image file for visit 158370, using filter 1 ('r'), for sensor R01_S02, and 'snap' 000 (only one snap per visit for this project).

One may also compare the results of Run 1.0a (December) with these new data. The old data, r-band only, reside here: /global/projecta/projectdirs/lsst/production/DC2/old.Dec2017/DC2-phoSim-2-r/output, one visit per sub-directory, as above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-358398794, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jMlIVJFMjFlbEJOBTipAJpqIys-yks5tLjyygaJpZM4RHrHX.

TomGlanzman commented 6 years ago

A reservation is a possibility, although the amount of lead time required for a reservations is comparable to the time awaiting for batch job to run. At the moment, there are (surprisingly) 46 KNL nodes running which will handle Run 1.0b (5 visits x 6 filters).

katrinheitmann commented 6 years ago

Hi Tom,

ok then. Usually the machines are much less busy in January because lots of new projects start and people are not immediately ready to go for full up runs. So not too surprising. But that's great that these are then available for testing already.

On 1/17/18 2:56 PM, Tom Glanzman wrote:

A reservation is a possibility, although the amount of lead time required for a reservations is comparable to the time awaiting for batch job to run. At the moment, there are (surprisingly) 46 KNL nodes running which will handle Run 1.0b (5 visits x 6 filters).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-358442266, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jKfBMz1EK3y32iLGtDlGqC6gHG0eks5tLl51gaJpZM4RHrHX.

TomGlanzman commented 6 years ago

Thursday 18 Jan 2018 Update:

Much progress for Run 1.0b (five visits for each of six filters). As of 07:50 PST fully 83% of all sensor-visits have successfully completed. Run statistics are beginning to shape up and currently look like this:

Task	sensor-visits	mean clock time (Raytrace)
DC2-phoSim-2-u	422/431 complete	219 +/- 71 min
DC2-phoSim-2-g	509/512 complete	214 +/- 37 min
DC2-phoSim-2-r	379/380 complete	260 +/- 46 min
DC2-phoSim-2-i	224/224 complete	234 +/- 62 min
DC2-phoSim-2-z	142/377 complete	705 +/- 50 min
DC2-phoSim-2-y	336/500 complete	356 +/- 124 min
Total	2012/2424 complete (83%)

TomGlanzman commented 6 years ago

Friday 19 Jan 2018 Update:

As of 16:25 PST, there are 31 sensor-visits still running. These stragglers are mostly in the z-band, with a couple in the y-band. These two bands are requiring significantly more CPU effort per visit than the other bands. The other four bands are complete (5 visits each).

TomGlanzman commented 6 years ago

Monday 22 Jan 2018 Update:

The final sensor visits completed early Sunday morning (yesterday), so Run 1.0b is complete.

However: Due to a speckling issue, a new version of phoSim has been released (v3.7.7). Heather has installed the new code at NERSC and the plan is to reprocess the very first r-band visit, DC2-phoSim-2-r stream 000000, visitID 151687. The old (v3.7.6) data has been moved aside into this directory: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000.v3.7.6 while the new (v3.7.7) data will be placed here: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000

Jobs are running now with an ETA to completion around 20:00 Pacific this evening.

Note: there may be yet another phoSim code release as the root cause of the speckling is understood and fixed.

TomGlanzman commented 6 years ago

Tuesday 23 Jan 2018 Update:

The test jobs (a single r-band visit) using phoSim v3.7.7 completed last evening but continue to show the speckling problem. The PhoSim team has now reproduced the problem and will advise when new code is available for testing.

In the mean time, continue preparations for Run 1.1...

(LATER) The PhoSim team has released v3.7.8. Heather has installed. Rolling back the first visit in the r-band (visitID 151687). The previous test output has moved aside into: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000.v3.7.7 while the new (v3.7.8) data will be placed: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-2-r/output/000000

Job started at 10:37 so will likely be a few hours before the first sensor-visit rolls off the assembly line.

NOTE: I discovered that the method used to determine the type of processor used (haswell vs knl) started to fail just after the new allocation year changes went into effect on January 9th. Thus, all of Run 1.0b raytrace jobs which ran on KNL used code optimized for haswell. The only impact is execution time. This problem has now been fixed.

johnrpeterson commented 6 years ago

Fixed, Tom. Please use phosim v3.7.8.

TomGlanzman commented 6 years ago

Wednesday 24 Jan 2018 Update:

A test of phoSim v3.7.8 indicates the speckling problem seems to have been solved, see (https://github.com/LSSTDESC/DC2_Repo/issues/69#issuecomment-359990670) and #105 ). Thanks to the PhoSim team!

No production running today - only various tests (#101) and development (#82) in preparation for Run 1.1

TomGlanzman commented 6 years ago

Friday 16 Feb 2018 Update:

Production for Run 1.1 (phosim) is imminent. Initial test of revamped instanceCatalog generation, dynamic SEDs, updated visit lists, and other config changes and bug fixes has been completed for a single visit in the r-band (WFD). Data products for this test are here:

/global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-3_WFD-r/output/000001

Note that this test is NOT the final configuration as a final tag of DC2_repo, and agreement on the phoSim "--fov" parameter is needed -- so these data will be overwritten with production data in the near future.

Run 1.1 consists of the the following visits.

Band	Survey	#Visits
u	WFD	67
g	WFD	91
r	WFD	245
i	WFD	223
z	WFD	247
y	WFD	252
u	uDDF	192
g	uDDF	138
r	uDDF	138
i	uDDF	137
z	uDDF	136
y	uDDF	135
-	TOTAL	2001

TomGlanzman commented 6 years ago

A new DC2 phoSim test visit has been completed, obsHistID=181866 with 63 sensors simulated. Data products are here: /global/projecta/projectdirs/lsst/production/DC2/DC2-phoSim-3_WFD-r/output/000002 . The biggest change since yesterday's test was the change in the --fov parameter from 0.5 to 2.1 (degrees) during instanceCatalog generation. Please have a look.

TomGlanzman commented 6 years ago

Monday 19 Feb 2018 Update:

Ducks are lined up, so begin a production for Run 1.1, starting with r-band, WFD.

TomGlanzman commented 6 years ago

Tuesday 20 Feb 2018 Update:

Production is ramping up. All 245 visits for r-band WFD have been submitted. Twenty-five for each of the remaining 11 configurations have also been submitted.

There have been a few failures: 1) the expected failures when a Pilot terminates (normally, after it times-out); and, 2) the familiar issue of phoSim reading one of its many files during initialization. This latter issue arises when many phoSim instances attempt to start (near) simultaneously. A simple rollback solves the problem.

At 20:50 today, 425 sensor-visits have completed and 1904 are running. I expect the number of running to increase as new KNL Pilot jobs start to run.

Here is a link to view all 12 workflows on one page.

Addendum: A new failure is beginning to appear in the instanceCatalog generation. This has been reported here.

As the starting benchmark, our NERSC allocation now stands at:

19:51 2/20/2019  m1727 cpu balance = 94,572,858.2

TomGlanzman commented 6 years ago

Thanks @johannct, I actually provided tricklestream to Stephan some years ago so am quite familiar with its use. In this case, we do not yet know enough to use this tool effectively -- we are still in the process of discovering the limits of the instanceCatalog generation step. Also, given that DC2-phoSim consists of 12 independent pipeline tasks, trickleStream will need some work (and thought) to make it useful.

TomGlanzman commented 6 years ago

Wednesday 21 Feb 2018 Update:

Production continues to ramp up. At the moment (09:00) there are 27 fully completed visits and 4128 completed sensor-visits (which includes both fully and partially completed visits). 1561 sensor-visits are currently running on 46 KNL nodes (about 34 phoSim instances/node). Another 200 nodes have been requested but are not yet running.

A couple of operation issues arose in the last 12 hours.

135 solved and closed

136 open

10:25 - create a new DBstaging area in $SCRATCH and make copies of the agn and minion(obssim) databases, then point the instanceCat generator to use those copies. Let's see if that improves scalability...

12:32 - update the agn database after @danielsf modifies the file to eliminate the creation of two temporary files (which were creating problems elsewhere).

16:10 - All remaining visits for the 12 visit categories have now been submitted. The program is: submit Pilots; wait; rollback; repeat. Currently, 48 visits (of 2001) have completed.

TomGlanzman commented 6 years ago

Thursday 22 Feb 2018 Update:

As of 09:00 there are 66 fully completed visits (3% of Run 1.1's 2001 visits) with many more in progress. Cori utilization has been rather low the past 24 hours: a max of 46 nodes dropping down to a mere 8 for several hours this morning. About 1000 nodes have been requested, so we simply must wait for SLURM to let them run.

Good news is that both issues opened yesterday have been solved and closed. In addition to the expected job terminations due to Pilot time-outs, a new failure is starting to appear: segmentation faults in phoSim's atmosphere creator, which seems correlated with many simultaneous instances (in the 30-50 range). This problem acts as a transient, generally succeeding upon rollback, but bears watching.

Handy links (note some are updated only every few minutes):

Workflow summary page for all 12 configurations
Brief Pilot job summary
SLURM queue for waiting Pilot jobs
Plot of cori utilization since start (19 Feb 2018)

TomGlanzman commented 6 years ago

Friday 23 Feb 2018 Update:

As of 10:30 today, there are 76 fully completed visits representing 4% of total. Production has ground to a virtual halt as the small set of Pilot jobs that started running 2 days ago have all completed. Submitted Pilot jobs are just not running! There are jobs submitted on the 19th that are still waiting for a chance to run. SLURM indicates that up to 170 may come available in ~9.4 hours, but those estimates are largely unreliable. Perhaps there will be a surge of activity starting this evening. In the meantime, I am using various tricks to prepare the maximum number of waiting 'raytrace' steps as possible.

While waiting, I put together an experimental task monitoring page to give me a better global view of what the 12 tasks are up to. It is not very pretty but it's a start. (Hint: the good stuff is at the bottom.)

drphilmarshall commented 6 years ago

Thanks @TomGlanzman ! Sorry to hear we are not yet up to LSST data rates - but then I expect commissioning might be a bit like this too ;-)

While we are waiting for the conditions to improve, would you mind breaking down that 4% by band and survey please? Here's an extended version of the table you made further up the thread, is it easy for you to fill in (or automagically over-write) the table below? You can edit this message no problem, or just paste in your own version in a subsequent comment. I'm interested to see what kind of DM processing we could be doing. Thanks!

Band	Survey	Target # Visits	Completed # Visits	% complete
u	WFD	67	0	0
g	WFD	91	2	2
r	WFD	245	67	27
i	WFD	223	1	.4
z	WFD	247	4	1.6
y	WFD	252	2	.8
-------	---------	---------------	-------------------	-------------
u	DDF	192	0	0
g	DDF	138	0	0
r	DDF	138	0	0
i	DDF	137	0	0
z	DDF	136	0	0
y	DDF	135	0	0
-------	---------	---------------	-------------------	-------------
-	TOTAL	2001	76

(Note from Tom: the values in the "Completed # Visits" column may be easily obtained from this workflow web page.)

TomGlanzman commented 6 years ago

Saturday 24 Feb 2018 Update

Not much happening today -- Pilot jobs submitted last Monday still have not started. Consider:

JOBID     ST  USER     NAME         NODES REQUESTED USED  SUBMIT               QOS        SCHEDULED_START      FEATURES        REASON    
10393524  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-19T19:31:24  regular_1  2018-02-26T19:40:00  knl&quad&cache  Resources
10414515  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:28  regular_1  2018-02-26T19:40:00  knl&quad&cache  Resources
10414517  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:29  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414518  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:30  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414519  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:31  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414520  PD  descpho  phoSimK-20*  20    48:00:00  0:00  2018-02-20T19:46:32  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources
10414530  PD  descpho  phoSimK-50*  50    48:00:00  0:00  2018-02-20T19:48:43  regular_1  avail_in_~0.1_hrs    knl&quad&cache  Resources

Job submitted last Monday are currently scheduled to run next Monday!

In the meantime, I am continuing to run the catalog 'trim' bit manually on a KNL interactive node (max 4 hour limit).

TomGlanzman commented 6 years ago

Sunday 25 Feb 2018 Update

More of the same (see yesterday's report). The priorities at NERSC seem completely given over to huge MPI jobs - at the expense of the type we need for DC2-phoSim production... :(

salmanhabib commented 6 years ago

We need to bug someone at NERSC; the queueing system was always something I had complained about. I will send an email to Richard Gerber.

TomGlanzman commented 6 years ago

Monday 26 Feb 2018 Update

The first half of the day was a repeat of yesterday -- no activity. Then at 5 minutes 'til noon, the first large SLURM jobs began to run. At first, two jobs (each with 20 nodes), followed by another job of 20 nodes. With these 60 nodes, about 2000 single sensor-visits (raytrace) jobs are running.

The task summary page has been updated and is now a bit easier to read. The tables indicate the overall progress of the DC2-phoSim Run 1.1.

The first table sums up the overall number of workflow batch jobs.
The second table breaks this down according to the process step (setupVisit = instanceCatalog generation and phoSim initialization; RunTrim is the phoSim catalog trimming step; RunRayTrace is the time-consuming ray tracing for a single sensor for a single visit; finishVisit runs when all sensors for a particular visit have completed). Thus, Run 1.1 will be complete when there are 2001 successful 'finishVisit' steps.
The final set of tables gives the summary for each workflow task == each survey configuration, e.g., WFD r-band.

By the end of the day, 170 KNL nodes were online with phoSim and each node has been reserved for 48 hours. Good news!

salmanhabib commented 6 years ago

I sent an email to Richard Gerber yesterday -- let's see what he says.

salmanhabib commented 6 years ago

Ok, Richard got back to me. He will get someone to talk to Tom; what they want is an estimate of the required throughput from us. You can provide that, right, Tom?

TomGlanzman commented 6 years ago

Tuesday 27 Feb 2018 Update

07:30 Finally! After a week's wait, a significant number of Cori resources are beginning to come online for DC2-phoSim. At this time, 8 Pilot jobs, on 190 nodes ( 12160 cores) are running, which support ~6400 raytrace instances (sensor-visits). There are currently 183 fully complete visits, or about 9% of total.
13:45 Cori continues to deliver more nodes. At this point 17 Pilot jobs, on 370 nodes ( 23680 cores) are running >7,100 raytrace instances.
An updated DC2 phoSim monitoring page is now available (it auto updates every 6 min)
Some evidence of Workflow Engine (Pipeline II) stress:

TomGlanzman commented 6 years ago

Wednesday 28 Feb 2018 Update

17:30 Today was a pretty good day, starting off with a continuation of 370 KNL nodes. At the current time slightly more than 450 visits have been completed (~22% of total for Run 1.1), while many more are under way. There is a considerable backlog of SLURM/Pilot jobs, some have been in queue for 7 days, so one hopes new jobs will start soon. NOTE: due to Pilots running out of time, there will be a large burst of failed jobs -- that is to be expected; they will be rolled back and restarted.

An interesting conversation with a couple of NERSC folks this afternoon. One reason for the poor SLURM response this past weekend was due to a massive 9,000 node MPI job that ran for several days. (Cori-KNL has a total of about 9,800 nodes, so this had a serious impact on the rest of us.) Other than that, they reaffirmed that the only strategies for better throughput they can offer include:

Limit [Pilot] jobs to run fewer hours (DC2 jobs request the maximum of 48 hours to be efficient)
Make a reservation (this requires one to specify an exact # nodes and run time, plus up to several days for nodes to drain to make room...if we do not see some job action by tomorrow mid-day, I may request a reservation, although this raises operational issues)

Neither of these two strategies is a silver bullet and will end up costing our allocation.

I also mentioned the lack of accuracy in the SLURM job dispatch estimate (in response to the 'sqs' command, for example), the burstiness of SLURM dispatch, the inability to execute a controlled ramp-up of production, and inability to control the rate at which jobs start. Maybe a fresh look will help solve some of these problems.

TomGlanzman commented 6 years ago

Thursday 1 Mar 2018 Update

11:15 There are currently 548 visits complete (27% of total). However, the number of sensor-visits complete is probably more like ~43% complete. We are currently running with a mere 200 nodes.

While the number of Pilot jobs is decreasing after running out the 48-hour clock, SLURM has not been kind to us in replenishing this effort. There remain dozens of jobs being held hostage in queue, some for over one week. Perhaps Cori is preparing of another mega-job and not allowing our 48-hour jobs to backfill... The question now is whether to request a 'reservation' - the request for which would take several days to process? From today's standpoint, something like 300-400 nodes for 2-3 days would make a huge dent in the remaining work to be done. But one cannot predict whether or how many of the existing queued jobs may start to run before a reservation could be put in place.

15:10 The number of nodes working on phoSim has decreased to a mere 40. Coming to the end of the week, I am wondering whether we might get lucky with a few hundred nodes to finish off the phoSim part of Run 1.1, so I looked at the competition already submitted on Cori. Here is a list of the largest jobs in the queue which are likely to run this weekend (recall that Cori-KNL has 9,688 nodes):

JOBID             ST   USER         NAME         NODES REQUESTED     USED         SUBMIT                   QOS              SCHEDULED_START       FEATURES         REASON                               
10601130          PD   stanier      ikink        2048   24:00:00     0:00         2018-02-28T21:07:31   regular_0           2018-03-02T14:00:00   knl&quad&cache   Resources
10535604          PD   u6338        fullKahuna*  5600   36:00:00     0:00         2018-02-26T13:33:57   regular_0           2018-03-02T14:00:01   knl&quad&cache   Resources
10599218          R    heitmann     test_knowh*  6144   8:00:00      28:45        2018-02-28T19:46:13   regular_0           2018-03-01T13:39:41   knl&quad&cache   None 
10599371          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:54:40   regular_0           2018-03-04T02:00:00   knl&quad&cache   Resources
10599382          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:55:00   regular_0           2018-03-05T02:00:00   knl&quad&cache   Resources
10599385          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:55:07   regular_0           avail_in_~0.1_hrs     knl&quad&cache   Priority
10599394          PD   heitmann     test_knowh*  6144   24:00:00     0:00         2018-02-28T19:55:18   regular_0           avail_in_~0.1_hrs     knl&quad&cache   Priority
10535600          PD   u6338        allKahuna_*  9200   2:00:00      0:00         2018-02-26T13:33:53   regular_0           2018-03-02T11:17:09   knl&quad&cache   Resources

There is a 9,200 node mega-Job but it runs only for 2 hours. The five 6,144 node jobs, however, represent 63% of Cori-KNL, and four of them run for 24 hours each - which may be our biggest competition for the the coming days.

TomGlanzman commented 6 years ago

Friday 2 Mar 2018 Update

There are 550 complete visits, basically where we were yesterday. NERSC has ramped us down to a mere 40 nodes. There are 41 Pilots (880 nodes) waiting -- for up to 9 days -- to start running. So, while essentially idle, we wait...........

TomGlanzman commented 6 years ago

Weekend 3-4 Mar 2018 Update

08:45 Sun: After a slow start, SLURM finally began dispatching jobs Saturday; we are currently running with 200 nodes and 666 visits are complete (33%).
But issues with phoSim have resulted in two actions being taken this morning: 1) hold all pending Pilot jobs; and, 2) cancel the 50-node/3-day 'reservation' due to start tomorrow morning. It is unclear whether Run 1.1 will continue, but there may be a Run 1.2...

TomGlanzman commented 6 years ago

Monday 5 Mar 2018 Update

Run 1.1, although incomplete, is winding down. 713 visits are complete (36%).

Preparations for phoSim testing is underway in #140

TomGlanzman commented 6 years ago

Monday 2 Apr 2018 Update

Testing to resume production has gotten underway. With new phoSim background parameters and updated catalog generation, this new project is dubbed "Run 1.2p". A single visit with a non-production catalog generation is running here.

A NERSC 'reservation' has been requested (100 KNL nodes for 24-hours) to help jump-start this project. If granted, this reservation will begin sometime Thursday, 5 Apr 2018. Hopefully all production code will be in place by that time.

TomGlanzman commented 6 years ago

Tuesday 3 Apr 2018 Update

09:45 - The first Run 1.2p test run continues and the first six (of 32) sensors for this visit have completed. A first quick look at the FITS files indicates reasonable image files. (Chris W's worry about commented overrides being interpreted by phoSim did not come to pass; the FITS headers show that those commented background parameters were properly ignored.) There is a ~x2 difference in background across the field of view in these first few sensors. Interested parties are invited to look at both 'electron' and 'amplifier/ADC' image files which continue to populate this directory: /global/projecta/projectdirs/lsst/production/DC2/DC2-R1-2p-WFD-r/output/000000

11:00 - All 32 sensor-visits for the test vist have completed

An improvement in dataCatalog registration has been implemented. PhoSim output files are now differentiated by the "datatype" metadata: PHOSIM_E (electron images); PHOSIM_A (amplifier/ADC images); PHOSIM_C (centroid files). These metadata may be specified with searching the dataCatalog to restrict the result set.

TomGlanzman commented 6 years ago

Friday 6 Apr 2018 Update

DC2 Run 1.2p began yesterday evening! Five single-node 'premium' class SLURM Pilot jobs were submitted to get the ball rolling.

A fortuitous decision to "hold" rather than "cancel" a batch of SLURM jobs from Run 1.1p in February has lead to the discovery that once released, these jobs start very quickly. There are 31 such "held" jobs and they collectively represent 680 nodes each for 48 hours. Eleven of these 20-node jobs have been released and are now running. A good way to start the production!

11:00 status

16 SLURM Pilot jobs running on 225 hosts (14,400 cores)
NERSC allocation charged (as of midnight) = 8,658,110 (of 100M) NERSC hours
583 (of 2001) visits submitted to workflow
136 sensor-visits complete
7338 jobs running
1 visit complete (but zero sensors!)

katrinheitmann commented 6 years ago

Excellent news! Glad we gave this a try rather than killing the jobs back then.

On 4/6/18 12:59 PM, Tom Glanzman wrote:

Friday 6 Apr 2018 Update

DC2 Run 1.2p began yesterday evening! Five single-node 'premium' class SLURM Pilot jobs were submitted to get the ball rolling.

A fortuitous decision to "hold" rather than "cancel" a batch of SLURM jobs from Run 1.1p in February has lead to the discovery that once released, these jobs start very quickly. There are 31 such "held" jobs and they collectively represent 680 nodes each for 48 hours. Eleven of these 20-node jobs have been released and are now running. A good way to start the production!

11:00 status

16 SLURM Pilot jobs running on 225 hosts (14,400 cores)

115 sensor-visits complete

7350 jobs running

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/65#issuecomment-379330248, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jH5jTCF6_dtT9XKvmNu1YLFqdNSYks5tl60ZgaJpZM4RHrHX.

TomGlanzman commented 6 years ago

Saturday 7 Apr 2018 Update

Thanks to those old SLURM jobs, production continues to run at a good pace. Fresh jobs submitted yesterday are slowly moving up in the queue. We have a 100-node 24-hour reservation coming up Monday morning at 10am. With all of these factors, production may smoothly continue even as the old jobs terminate.

13:00 status:

48 visits (of 2001) fully complete (2.4%)
10,658 sensor-visits (raytrace) jobs active on 320 nodes (20,480 cores)
All "WFD" DC2 visits submitted (and about 10% of the uDDF visits submitted)
Ramp-up continues...

TomGlanzman commented 6 years ago

Monday 9 Apr 2018 Update

A productive weekend. After a reasonably stable weekend, will begin again to ramp up the number of nodes. A 100-node 24-hour reservation begins today at 10am.

08:00 status:

99 visits complete (5% of 2001)
18,794 complete sensor-visits (this will probably turn out to be ~10% of the total)
12,795 running sensor-visits (raytrace) on 280 nodes
As of midnight last night, DESC has consumed 10.4 M NERSC hours (of 100M) in its AY2018 allocation; at 10.4% we are immune to losing allocation this quarter.
Processing (wall clock) time for the WFD-r raytrace step: This indicates that there are some sensor-visits timing out in the 48-hour NERSC queue.

TomGlanzman commented 6 years ago

Tuesday 10 Apr 2018 Update

A good 24 hours of normal running, increasing the number of nodes to 760.

07:50 status:

182 visits fully complete (9%)
28,638 sensor-visits complete (~18% complete)
25,760 instances of raytrace running on 760 nodes
A growing number of sensor-visits requiring >48 hours has begun to appear. A snapshot at this moment shows the following. In the WFD r-band, there are currently 54 timed out sensor-visits with 13,352 complete. The WFD z-band currently has 873 timed out sensor visits with only 36 complete. The WFD y-band has 716 time outs with 622 complete. And so on... My plan is to wait until the end of Run 1.2p, then request a several-day reservation to take care of the outliers.

For the WDF y-band workflow, the execution time distribution for the successful 622 sensor visits looks like this: wfd-y

For the WDF z-band, only 36 of 1009 attempts successfully completed. The execution time distribution appears below: wfd-z

15:00 status: Run 1.2p is losing cori-knl nodes; as jobs complete new jobs are not starting quickly. Thus, only 470 nodes are currently active (although 2461 are requested via SLURM).

TomGlanzman commented 6 years ago

Wednesday 11 Apr 2018 Update

The ramp-up of Run 1.2p has stalled and even lost momentum. This is because NERSC is not running jobs that have now been in queue for over five (5) days. At 09:45 I released the final set of jobs submitted in February (on the 22nd), so will see if those 3 jobs of 20 nodes each start running soon.

09:45 status:

231 visits fully complete (11.5% of 2001)
34,478 sensor-visits complete (about 22% of total)
1,658 sensor-visits have timed out after 48 hours of running
13,598 sensor-visits (raytrace) jobs currently running on 400 nodes

TomGlanzman commented 6 years ago

Thursday 12 Apr 2018 Update

As of this morning the pipeline has completely run dry: there are zero jobs running at present. Despite having >3000 nodes of batch jobs in the queues, some of which have been waiting for nearly a week, the NERSC scheduler is not providing computing resources, so production has ground to a full stop.

07:45 status:

245 visits fully complete (12% of 2001)
37,842 sensor-visits (raytrace) complete (~ 24% of total)
0 jobs running

09:30 update: The DC2 production has stopped due to full machine reservations associated with the yearly Gordon Bell challenge, for which the deadline is this coming Sunday. The current set of reservations started this morning at 09:00 and last for 27 hours. Hopefully DC2 jobs will again start to run shortly after noon tomorrow (Friday).

Noon update: The full machine reservations were cancelled at the last moment. Jobs have started to run again. We are currently up to 480 nodes ...oops, make that 430 (apparently 50 nodes crashed!)

TomGlanzman commented 6 years ago

Friday 13 Apr 2018 Update

Production has now been running for one week. After a 2-day ramp-down, an 8-hour outage and the false threat of a lengthier (27-hour 'reservation') outage to follow, production quickly ramped back up to just under 500 nodes where it has remained since yesterday noon.

07:40 status:

245 visits fully complete (12% of 2001)
38895 sensor-visits complete (24%)
14,030 jobs running on 480 nodes
There continues to be a growing number of sensor-visits that time-out after 48 hours...

Note that ~4M NERSC hours have been spent on Run 1.2p thus far and it is nearly 25% complete.

TomGlanzman commented 6 years ago

Saturday 14 Apr 2018 Update

Something bad happened just after midnight this morning which caused 600 nodes to crash (at least that what the clues point to). Further, whatever caused this to happen continued to plague new jobs all the way up to 08:00 this morning. The end result is that >81,000 jobs were abnormally terminated. It is a technical challenge simply to roll back such a large number of jobs :( (A NERSC ticket has been submitted to inquire about this episode but may not hear back until Monday.)

13:40 status:

255 visits fully complete (13% of 2001)
41,481 sensor-visits complete (26% of total)
16,997 jobs currently running on 500 nodes
~80,000 sensor-visit jobs to roll back

TomGlanzman commented 6 years ago

Sunday 15 April 2018 Update

Recovery from yesterday's mishap and, perhaps surprisingly, continued ramp up.

13:50 status:

315 visits fully complete (16%)
45,837 sensor-visits complete (29%)
30,934 sensor-visit jobs currently running on 910 hosts (!!)

18:00 status:

production has (amazingly) ramped up to 1010 nodes. This corresponds to >34,000 sensor-visits running in parallel. The only repeatable problem observed is a low frequency of crashes reading the focalplanelayout.txt file (one of phoSim's config files); rollbacks seem to solve the symptom. Many of the currently running sensor-visits are in the WFD z-band and i-band, both of which have a high rate of failure time-outs, so a good fraction of these may also fail. (Again, the proposed solution is to gather up all the time-out failures and request a large reservation of >>48 hours to handle these sensor-visits.)

TomGlanzman commented 6 years ago

Monday 16 Apr 2018 Update

A surprisingly good weekend -- after recovering from the Saturday morning meltdown.

07:45 status:

401 visits fully complete (20% of total)
54672 sensor-visits complete (34% of total)
35,612 sensor-visits currently running on 1060 nodes
~7800 sensor-visits timed-out (>48 hours)

Another look at our NERSC allocation... Between 5 April and today, 9,957,846 billable NERSC hours have been consumed. In this time, 54,929 raytrace jobs have been successfully completed, representing about 34% of the total. Assuming that Run 1.2p is the only significant consumer of resources, one might conclude that completing this run might require roughly 30M NERSC hours. Why is this number so high?

Firstly, in addition to the 54,929 successful raytrace jobs, there were also 12,497 long-running jobs that timed out after 48 hours and, thus, represent wasted effort because they will all need to be started again from the beginning. [We are currently running 33 instances of raytrace per Cori-KNL node, thus a single timed-out instance of raytrace consumes 48*96/33 = 140 NERSC hours. 12,497 such jobs thus represent 1.7M NERSC hours, or about 17.5% of the total.]

Secondly, whenever a Pilot job times out (after 48 hours), any partially completed jobs crash and must be restarted (unless that job ran for the full 48 hours, in which case it is ignored at present). Given that Run 1.2p jobs are taking 20 to >48 hours to complete, there is a significant inefficiency, possibly as high as 40%.

LSSTDESC / DC2-production

protoDC2 (Run 1.2p) phoSim operations log #65

Thursday 21 Dec Update

Monitoring the workflow

Saturday 23 Dec 2017 Update:

Tuesday 16 Jan 2018 Update:

Wednesday 17 Jan 2018 Update:

Thursday 18 Jan 2018 Update:

Friday 19 Jan 2018 Update:

Monday 22 Jan 2018 Update:

Tuesday 23 Jan 2018 Update:

Wednesday 24 Jan 2018 Update:

Friday 16 Feb 2018 Update:

Monday 19 Feb 2018 Update:

Tuesday 20 Feb 2018 Update:

Wednesday 21 Feb 2018 Update:

135 solved and closed

136 open

Thursday 22 Feb 2018 Update:

Friday 23 Feb 2018 Update:

Saturday 24 Feb 2018 Update

Sunday 25 Feb 2018 Update

Monday 26 Feb 2018 Update

Tuesday 27 Feb 2018 Update

Wednesday 28 Feb 2018 Update

Thursday 1 Mar 2018 Update

Friday 2 Mar 2018 Update

Weekend 3-4 Mar 2018 Update

Monday 5 Mar 2018 Update

Monday 2 Apr 2018 Update

Tuesday 3 Apr 2018 Update

Friday 6 Apr 2018 Update

Saturday 7 Apr 2018 Update

Monday 9 Apr 2018 Update

Tuesday 10 Apr 2018 Update

Wednesday 11 Apr 2018 Update

Thursday 12 Apr 2018 Update

Friday 13 Apr 2018 Update

Saturday 14 Apr 2018 Update

Sunday 15 April 2018 Update

Monday 16 Apr 2018 Update