PhoSim v3.7.14: performance and validation

TomGlanzman commented 6 years ago

This issue discusses performance and validation of a new PhoSim release, v3.7.14. The motivation for this work is to recover the remaining ~10,000 long-running sensor-visits in Run 1.2p. As background, the following is excerpted from Issue #65 the Run 1.2p operations log:

Recent work by Adrian Pope and Tom Uram at Argonne motivated the release of phoSim v3.7.14, which has improved thread scaling. Recall that DC2 production -- so far -- has been limited to 8-threads in the raytrace step due to internal code inefficiencies when using more threads [ref work last Fall by Glenn and Tom]. As a performance and validation check, a single DC2 Run 1.2p visit has been run with the latest phoSim code. This note summarizes the tests.

The executive summary is that the code changes do, indeed, allow for efficient running of many threads, thereby enabling us to attempt recovering the ~10,000 remaining sensor-visits in DC2 Run 1.2p. There are some differences between the 8-thread (v3.7.9) and 54-thread (v3.7.14) images. Although these differences do seem not qualitatively significant, it would be good if others could take a more quantitative look. [Ref]

Discussion of performance and validation are welcome here.

TomGlanzman commented 6 years ago

This discussion was started in issue #65. To pick up the thread on performance, @adrianpope says:

I'm not exactly sure how pthreads work with the compute node kernel, slurm, and affinity settings, but it
 is possible that pthreads are not "pinned" to a particular set of cores/HW-threads, and they may be
 allowed to migrate around the hardware resources. In this case, if there are fewer than 5 phosim
 instances running on a node, the pthreads from the remaining instances might be allowed to "spread
 out" and run a bit more quickly (though possibly less efficiently overall). I don't know as much about the
 runtime rules for pthreads on XC40/KNL as I do about OpenMP threads, so I'm trying to chat with Intel
 folks to learn more about how to control and measure this.

Indeed, pthreads are not, by default, pinned to specific cores or hw threads. I did some work with 'taskset' last year after watching a node full of (multiple instances) x (multiple threads) using 'top'. My observation is that threads seem to freely move about the node's cores, independent of how many instances are running. By locking an instance of, say, 8-threads to a specific set of cores, there is a definite performance improvement.

I'm also not exactly sure how the workflow works, but I had assumed that it tried to keep the same
 number of phosim instances active on a compute node until hitting the wall clock limit, eg. when one
 phosim instance on a node finishes, the workflow finds the next sensor visit from a task queue and
 launches it on the compute node, so very little time is spent with fewer than the expected number of
 phosim instances active.

Yes, when there are more jobs to process than available processing billets, the workflow will manage the resources and submit new jobs as old ones complete. If there is a homogeneous set of jobs in the workflow, e.g., exclusively 8-thread raytrace jobs, then the workflow will, in effect, attempt to keep the same number of jobs running on each node at all times. However, for an inhomogeneous workload, e.g., a mix of 8-thread raytrace + single thread 'trim' or bookkeeping steps, the workflow simply attempts to saturate the resources of that node (memory and cores) any way it can.

For my v3.7.14 tests, there were 26 sensor-visits to process so five (5) nodes were requested in an attempt to complete the test ASAP. No more than five instances of phoSim's raytrace ran on a single node at any given time but as instances completed, there were, in general, no new ones to fill the newly empty billet. The entire processing period was about 12 hours in duration.

cwwalter commented 6 years ago

Hi All,

Do you have a sense how much of the improvement was due to explicit code changes you made and how much came from using the Intel compiler? I'm wondering about how much using the compiler in other contexts would help.

adrianpope commented 6 years ago

I think the entirety of the improvement in the thread-scaling of raytrace is due to source code changes. The Intel compiler seems to speed up the single-threaded start up region in raytrace by a noticeable amount on KNL, but the performance differences between GNU and Intel compilers in the threaded region seems to be pretty small for v3.7.14 on KNL.

I don't think it's easy to predict which compiler will do what for an arbitrary code on different architectures, so my advice would be to compile with both and run some semi-realistic performance comparisons if possible.

Part of my bias toward Intel compilers in this particular context is that I'm trying to profile the code with Intel Vtune, and this is somewhat easier to do with Intel compilers, though it is certainly also possible to do this with GNU compilers.

TomGlanzman commented 6 years ago

@cwwalter here is a plot from last September (using a beta version of phoSim on a Haswell node) that shows the curious behavior of gcc vs. Intel for various numbers of threads:

And if 'taskset' is added in (processor affinity), the change becomes more interesting:

Of course, these plots will have changed with 3.7.14.

heather999 commented 6 years ago

Hi, Tom asked if I could run the same visit which was simulated using both v3.7.9 and v3.7.14 through what we sometimes call the mini-DRP, which involves ingesting and running processEimage followed by makeFpSummary. I've done that using the same reference catalog: /global/projecta/projectdirs/lsst/groups/SSim/DC2/reference_catalogs/dc2_reference_catalog_dc2v3_fov4.txt and the w_2018_14 version of the stack plus the LSSSTDESC copy of obs_lsstSim. The resulting files are all available for v3.7.9: /global/cscratch1/sd/desc/DC2/w_2018_14/test-phosim/output-v3.7.9 and v3.7.14: /global/cscratch1/sd/desc/DC2/w_2018_14/test-phosim/output-v3.7.14 Log files can be found in /global/cscratch1/sd/desc/DC2/w_2018_14. It may be useful for experts to take a look at the processEimage and makeFpSummary outputs. A quick visual can be found by looking at the png files produced by makeFpSummary v12449-fy-focalplane-summary-v3 7 9

fjaviersanchez commented 6 years ago

Some plots comparing the outputs from 3.7.9 and 3.7.14:

The background on 3.7.14 is 6% lower than in 3.7.9, in the plot below I am representing 3.7.9_pixel_value/3.7.14_pixel_value-1:

screen shot 2018-05-21 at 9 08 17 am

The overall distribution of magnitude looks a little bit fainter on 3.7.14 but I think that might be because of the difference in the background level:

Then I made spatial matching of the objects' centroids and the astrometry looks compatible:

Finally, I checked the measured magnitude difference (mag_3.7.9-mag_3.7.14) as a function of the magnitude measured in 3.7.9 and this is what I get:

The error-bars are the standard deviation of the difference divided by the square root of the number of entries in that bin, the big points are the mean in each bin and the small points are the values for each of the matched objects. And below is the histogram of the difference:

sethdigel commented 6 years ago

Here are images of the ratios (3.7.14/3.7.9) of y-band sensor images for PhoSim 3.7.14 and 3.7.9. I selected 16 of the 17 for which Tom's images are available for both versions of PhoSim. The images are displayed in a 4x4 grid below - this is just for convenience, not to represent how they are situated with respect to each other. I did not filter the images, just binned them down by ~32x. The scaling range is 0.94 to 0.965 .

compare_bright

I don't think that the ratio maps necessarily represent a problem. That said, I could not say why the spatial structures in the ratios look the way that they do.

Here is a list of the file names and the average ratios 3.7.14/3.7.9; the are displayed from left to right, top to bottom in the image above. The 3.7.14 images are typically a few percent less bright than 3.7.9. This may or may not be consistent with Javier's finding.

output/000006/lsst_e_12449_f5_R30_S20_E000.fits.gz 0.945123 output/000006/lsst_e_12449_f5_R30_S21_E000.fits.gz 0.945798 output/000006/lsst_e_12449_f5_R30_S22_E000.fits.gz 0.956791 output/000006/lsst_e_12449_f5_R31_S20_E000.fits.gz 0.947581 output/000006/lsst_e_12449_f5_R41_S01_E000.fits.gz 0.945655 output/000006/lsst_e_12449_f5_R41_S02_E000.fits.gz 0.958073 output/000006/lsst_e_12449_f5_R41_S11_E000.fits.gz 0.947830 output/000006/lsst_e_12449_f5_R41_S12_E000.fits.gz 0.948938 output/000006/lsst_e_12449_f5_R41_S20_E000.fits.gz 0.947632 output/000006/lsst_e_12449_f5_R42_S00_E000.fits.gz 0.944200 output/000006/lsst_e_12449_f5_R42_S01_E000.fits.gz 0.964168 output/000006/lsst_e_12449_f5_R42_S10_E000.fits.gz 0.946602 output/000006/lsst_e_12449_f5_R42_S20_E000.fits.gz 0.949912 output/000006/lsst_e_12449_f5_R42_S21_E000.fits.gz 0.953382 output/000006/lsst_e_12449_f5_R43_S10_E000.fits.gz 0.949302 output/000006/lsst_e_12449_f5_R43_S21_E000.fits.gz 0.944431

fjaviersanchez commented 6 years ago

@sethdigel are the images with the diagonal structure the corresponding to the sensors close to the edge of the focal plane/is the order on your 4x4 grid from left-right/top-bottom the same as the list of paths that you include in your post? I'll re-check the ratio.

cwwalter commented 6 years ago

@fjaviersanchez Are you comparing full focal planes? Due to the vignetting, if you were comparing something mostly on the outside of the focal plane to something mostly inside I think you would see an overall difference. Or is this a apples-to-apples sensor by sensor comparison?

fjaviersanchez commented 6 years ago

so for the first histogram I was using the ratio between R30_S20 in 3.7.9 and 3.7.14. I don't know how I read 25%, I was still jet-lagged I guess... I am finding 6% consistent with @sethdigel's findings.

For the magnitude histogram I take the full datasets so it's not completely apples to apples because there are more visits processed for one dataset than the other but, the overall trend looks like 3.7.14 allows to detect slightly fainter objects (expected because the background is lower). For the rest I match the two source catalogs and I only use objects that have been matched to something closer than 0.5 arcseconds so, it should very close to apples-to-apples.

johnrpeterson commented 6 years ago

so just to be clear, the changes from v3.7.9 to v3.7.14 affected the random number generation. this has more of an effect than you might expect, because everything including global things like the sky background level would be different (not just details for individual photons). so you should expect different patterns & levels for almost anything in phosim. the only thing to worry about is new unexpected results that didn't occur before, so as far as i can tell everything is ok.

sethdigel commented 6 years ago

are the images with the diagonal structure the corresponding to the sensors close to the edge of the focal plane/is the order on your 4x4 grid from left-right/top-bottom the same as the list of paths that you include in your post? I'll re-check the ratio.

Yes, the order in the 4x4 grid left-right/top-bottom is the same as in the list with the ratios. No, they do not seem to be the edge-most sensors. I was too lazy to lay them out like the focal plane images above. I may be able to try it tonight if it is worthwhile. I was mostly interested to see the shapes of the ratios.

fjaviersanchez commented 6 years ago

@sethdigel, in my opinion it is not necessary to lay them on the correct positions on the focal plane, any other thoughts on this @cwwalter?

LSSTDESC / DC2-production

PhoSim v3.7.14: performance and validation #163