Run 3 Long Jobs with multi-threaded PhoSim 3.6

drphilmarshall commented 7 years ago

@LSSTDESC/twinkles PhoSim v3.6 is out! The release email from the PhoSim team is pasted in below for our records. Congratulations, and a big thank you, to @johnrpeterson et al :-)

@TomGlanzman what timetable do you suggest we follow for running the remaining Twinkles 1 Run 3 "long jobs" with PhoSim 3.6, with the same commands and configuration as you used in Run 3.1, 3.2, and 3.3? I guess we'll need to check that the results from v3.6 match those from v3.5 in some of the short visits...

Dear DESC members-

I am pleased to announce that we have completed and validated the twelfth major release of the Photon Simulator-- PhoSim v3.6. The release combines a number of usability, software engineering, fidelity, and validation improvements and extends our ab initio photon/electron physics simulation tool of optical survey telescopes capable of running efficiently both in single user mode as well as on high throughput computing platforms and grids.

There are a number of new improvements in this version listed below, but one of the most exciting features is the enabling of multi-threading capabilities. PhoSim was always parallelizable at the individual chip level, but now the simulation of individual chips can be multi-threaded. This means that individual astrophysical sources will be simulated in parallel on multiple threads. If you have roughly N cores to run roughly N threads, then you will find that it comes close to finishing N times faster! This also amazingly requires the same amount of memory as 1 thread.

This has two main uses. On large scale computing, we will be using this for data challenges. The total number of CPU core hours are conserved but the whole calculation becomes much more efficient. We can take advantage of modern hardware architectures that would prefer multiple threads executing reducing the total CPU time per job and not growing the memory per core. The NERSC & SLAC teams regard this capability as essential to execute the DC2 & DC3 data challenges.

The second use is on individual laptops or desktops. In the single user mode, most simulations are usually a detailed simulation of 1 (or few) chip(s). With a realistic catalog which covers at least one whole chip, this can often take 1/2 hour (w/o background) and a few hours (w/ background). However, with multi-threading this can go close to N times faster! To utilize multi-threading you just have to do: "-t N" where N is the number of threads with your usual PhoSim command line. The optimal number for N is a fun project in itself, but in general it is optimal to roughly use the number of cores (real or hyperthreaded) you actually have on your laptop/desktop (e.g. with a mac you can type "sysctl -n hw.ncpu") My hope is that a little bit of multi-threading will change the way everyone interacts with PhoSim-- instead of launching PhoSim at the end of the day and looking at the images the next day, you can play around with it in real-time much more.

In addition to this capability, a number of other items have been addressed-- a more accurate turbulence outer scale distribution leading to better differential astrometry and PSF ellipticity patterns (https://lsst.rcac.purdue.edu/doc/astrometry.pdf), more accurate zodiacal and airglow emission, more accurate sensor physics including tree rings and brighter-fatter physics, fixes for contamination, tools for quick catalog creation and ray visualization (in phosim/tools), and validation of the 33,250 coefficients of the optical path difference maps of every controlled degree of freedom of the camera telescope system (tilt/shift/bending mode). We are also continuing to expand the functionality for simulating systems other than LSST. The documentation on the website has been redone in the past few weeks as well.

This particular version was a result of a lot of work by the PhoSim Team, especially Colin Burke, Jun Cheng, En-Hsin Peng, Glenn Sembroski, Matt Wiesner and a lot of support from the LSST System Engineering team, especially George Angeli, Chuck Claver, Bo Xin, Sandrine Thomas.

Some specific details follow:

1) You can get all documentation and download the release version here:

https://bitbucket.org/phosim/phosim_release/

2) Other ways to get or produce PhoSim data:

A) You can get calibration images (flats, darks, biases) for this version and large-scale science images here (which take a while to generate on your own). We haven't completed the v3.6 runs, but they will appear there as well. Go to globus.org and set up an account. Then use the endpoint: purduelsst#phosim and then go to /depot/lsst/phosim/data/

B) If you don't want to install phosim, you can get access to phosim via a simple web interface that is attached to 50,000+ cores on open science grid/diagrid here: www.diagrid.org (then create an account and select the phosim tool)

C) In addition to running phosim on your laptop/desktop, if you have a cluster or grid (i.e. OSG) with condor installed, run phosim as usual but simply use the "-g condor" option. This enables massive parallel simulations with minimal effort on your part.

D) Please also contact me, if you need a particular simulation run with PhoSim and do not want to learn how to use PhoSim. Many people have the expertise to run off a variety of simulations for you.

3) Specific new items in the v3.6 release:

Multi-threading capability

Validation of sensitivity matrix

Fully separated design from perturbations with perturbation mode command

Implementation of default LSST perturbations in file & move commands

General architectural simplification, software standards

Generic parabolic telescope (perfect telescope) ISC files

Simplification to minimal ISC file set

DES telescope ISC files

Tool for ray visualization

Tool for SED matching

Tool for quick catalog creation

Removal of setup that are unnecessary

Relative opacity generalization

Optical design update

WCS generic

Random number simplification and other helper improvements

Fix to zodiacal light and airglow

Generalized SED band

Adjustment to outerscale distribution

Tool for ZEMAX for body commands

Bright fatter anisotropy

Complex tree rings

Correlation coefficient validation

More functionality for complex focal planes

Fix for contamination absorption

4) Feedback and future:

As always, please file tickets for bugs at the site and/or email me any feedback directly for more complicated questions. v3.7 is mostly completed and will be done soon (early 2017), v3.8 is being worked on, and we will issue at least one patch for v3.6.1 to fix some outstanding overdue issues but may do more depending on your feedback.

Best Regards,

John

TomGlanzman commented 7 years ago

Yes, it would seem a validation is in order. There was nothing in the announcement regarding the implementation of multi-threading, so it will be interesting to see how the total workload is divided between threads (per photon? per source? something else?) and how the total execution time changes for some of our lengthiest visits. Heather has already offered to install the new code at SLAC & NERSC so we should be able to start working with v3.6 soon.

The work on checkpointing is coming along and I would like to get that project to a stage that it could be integrated into a workflow quickly. Note that dmtcp advertises full support for multi-threaded applications so this work will hopefully still apply with phoSim v3.6.

drphilmarshall commented 7 years ago

Excellent! I got the impression that the parallelization was per source, but it'd be good to check this in the v3.6 documentation. I had a quick look but there's no PIN about multithreading on the wiki, and the walkthrough has not been updated yet. Which source files should we be reading to understand how things work, John?

johnrpeterson commented 7 years ago

Its multithreaded on a per source basis (the photon level doesn’t work as there is too much thread divergence and inefficiencies). this is fine because for example the background is made up of thousands of sources.

all you have to do is have “-t N” where N is the number of threads. i will update the wiki documentation about this in a bit.

brianv0 commented 7 years ago

Hi John,

Have you tested it with 48 or more threads? Cori Phase II at NERSC supports up to 272 hardware threads (4 per core), so it'd be interesting to see if we can leverage that.

heather999 commented 7 years ago

phosim v3.6 is now available on Cori: /global/common/cori/contrib/lsst/phosim/v3.6 To use, you will want to "source /global/common/cori/contrib/lsst/phosim/setupPhosim.sh" to adjust the modules loaded on Cori and then carry on to run phosim as you typically would

johnrpeterson commented 7 years ago

yeah, i think en-hsin did either 24 or 48 tests on a cluster here at Purdue. personally, i usually just to 4 or 8 on my laptop. at some point there will be diminishing returns from the non-threaded setup, but that might be around 48 anyways, is my guess.

john

cwwalter commented 7 years ago

Its multithreaded on a per source basis (the photon level doesn’t work as there is too much thread divergence and inefficiencies). this is fine because for example the background is made up of thousands of sources.

Hi John,

If it is per source do you mean they are done one-by-one and then (in principle) added later at near positions? Let's say there was a galaxy and and a star do you do anything to make sure that BF still works? If you added all the light from the galaxy and then the start afterwards the effect would be lost right? Just trying to understand exactly what you mean..

-Chris

johnrpeterson commented 7 years ago

chris-

its even better than that. so say have two bright sources that are overlapping like you are imagining and then you do 2 threads. what will happen is that it will be simulating photons both at the same time on two different cores, but whenever an electron is collected it will update the collected electron image while its going. the other thread will then get to see the e-field from those new electrons during its simulation. so it really shouldn’t have any difference whatsoever even in the case of brighter-fatter.

we have redone all the thousands of intergration tests with 4 threads instead of the usual 1 and i haven’t noticed any changes in results, so we should all be ok. (in fact, given the speed ups, we probably will always run multi-threaded validation runs from now on). but if anyone notices anything strange please let me know.

john

cwwalter commented 7 years ago

its even better than that. so say have two bright sources that are overlapping like you are imagining and then you do 2 threads. what will happen is that it will be simulating photons both at the same time on two different cores, but whenever an electron is collected it will update the collected electron image while its going. the other thread will then get to see the e-field from those new electrons during its simulation. so it really shouldn’t have any difference whatsoever even in the case of brighter-fatter.

That's great. Thanks.

TomGlanzman commented 7 years ago

Hi John,

I am interested in learning a bit about the control flow of the new phoSim. Is there a document or flow chart that might provide an overview?

Tom

johnrpeterson commented 7 years ago

Tom, i don’t have a document, but basically the multithreading happens only in the core raytrace calculation and doesn’t have anything to do with the overall phosim workflow.

you will want to still run phosim with the condor option and then use Glenn’s script to convert it to NERSC or SLAC job submission commands. the only difference is you use the “-t N” option for the phosim invocation to send the signal to the jobs to be threaded. Glenn was just testing to see if it still works with his script, but if not we will let you know and update with the new version.

john

On Dec 2, 2016, at 5:00 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:

Hi John,

I am interested in learning a bit about the control flow of the new phoSim. Is there a document or flow chart that might provide an overview?

Tom

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/Twinkles/issues/420#issuecomment-264574095, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8iHLw5orh9WkCaVEWihzF-MAQ7wqks5rEJUJgaJpZM4LCrYR.

———————————

John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193

TomGlanzman commented 7 years ago

Some Twinkles validation of the new phoSim version 3.6.0 has been produced. Using exactly the same configuration as for the main Twinkles workflow task (TW-phoSim-r3), I have created two new workflow tasks: TW-phoSim-r3-MT (using phoSim v3.6.0 and running with four (4) threads); and, TW-phoSim-r3-noMT (using phoSim v3.6.0 with no multi-threading). A total of ten (10) visits were processed with TW-phoSim-r3-MT and five (5) visits with TW-phoSim-r3-noMT. Links to the workflow:

TW-phoSim-r3 TW-phoSim-r3-MT TW-phoSim-r3-noMT

While the multi-threaded phoSims were running, I was able to confirm the ongoing creation of up to four extra execution threads using a combination of tools (ps and top via lsrun, and farmrtmweb). I have not attempted to test different numbers of threads.

Timing results:
The multi-threaded phoSim appears to live up to its marketing! To summarize the ten+ten runs:

Average wall clock time ratio (v3.5.3/v3.6.0) = 4.1 Average CPU time ratio (v3.5.3/v3.6.0) = 1.4 Average job efficiency v3.5.3 (CPU/wall-clock) = 88% Average job efficiency v3.6.0 (CPU/wall-clock) = 66%

Note that there are situations when running large productions in which seemingly random jobs will exhibit unusual CPU and/or wall-clock times. This can be due to various reasons, such as transient I/O bottlenecks to a needed storage server; competing jobs on the batch host hogging critical resources; or other transient outages.

Part of the reason the v3.6.0 job efficiency took a hit is that during the phoSim execution, threads are continually being created and killed. Sometimes, not all four execution threads fully utilized for short periods of time. Part of this is likely due to the overhead of thread management, and part may be due to phoSim design. This loss of efficiency is offset by the reduction in total CPU time -- which I find slightly mysterious. In any event, the net savings in wall-clock time is significant, congratulations to the phoSim team!

(Some raw timing data comparing these 20 runs appear in this Google sheet)

Data Product Comparison: Each of these production and test runs produce only two output files: centroid(text) and image(fits). A three-way comparison (using 'diff' for the text and 'fdiff' for the fits files) indicated that none of the file combinations were identical in any sense of the term. The v3.6.0 fits files appear to have had some changes to the headers, but there are also significant differences in the body. The centroid files are quite different -- even with a few extra lines appearing.

Could these differences be attributed simply to random number seeds? Or other changes/features in the v3.6.0 release? Others with an interest in the difference details are invited to take a look for themselves. The files are at SLAC, e.g., for visit "000000":

TW-phoSim-r3: /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3/phosim_output/000000/R22_S11/output

TW-phoSim-r3-MT: /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3-MT/phosim_output/000000/R22_S11/output

TW-phoSim-r3-noMT /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3-noMT/phosim_output/000000/R22_S11/output

Note that the visit index, "000000", may be replaced by "000001" through "000009" for TW-phoSim-r3-MT, and through "000004" for TW-phoSim-r3-noMT.

There was one configuration hiccup associated with v3.6.0. A new dependency on the phoSim installation's data/sky directory suddenly appeared and required the 'sky' directory to be placed adjacent to the (staged) copy of the SEDs. Perhaps John, and Co. could comment on whether this is a bug or a feature?

Please feel free to add comments to this issue thread.

LSSTDESC / Twinkles

Run 3 Long Jobs with multi-threaded PhoSim 3.6 #420