Closed jchiang87 closed 6 years ago
i think it is an optimization bug because it is near the edge of the chip, and since it is a large extended source the buffer isn't large enough and it is getting confused. i'll try to get a new patch out in ~24 hours. i'm pretty sure i know the problem.
OK thanks. Can you clarify one thing in more detail? Relative to the discussion above on BF I thought you said that the background photons were optimized but not the photons from the astrophysical objects. So, what is getting optimized here and is it the same optimization as is being applied to the background?
its just photons off the chip that are optimized. its making a mistake with that.
@sethdigel I have a question about the DC2 picture that you showed above. I see that the background is brighter (a step in the right direction) but what is the median value for the background? (We expect ~796 counts for sky-brightness 21.2 in r-band)
@fjaviersanchez, the median in this phoSim e-image is 451 electrons/pixel.
15 or 30 seconds?
Here are a couple of focal plane images assembled from phoSim visits that Tom simulated. I selected obsHistID = 40337 and 201828 because they have the largest number of sensors simulated (182 and 173, respectively). I assembled these in a kind of quick and dirty way - scaling down the individual sensor images by a factor of 8 in x and y, then placing them according to their Raft and Sensor numbers. The scaling of the images is linear and the range is selected to bring out the simulated airglow (at least I think that's what it is).
Visit 40337:
Visit 201828:
At this level I don't see any obvious problems. For visit 201828 one of the sensor images seems to have a dimmer background level than its neighbors. The distribution of electrons per pixel has a strange periodic spike. These seem to be associated with the bleed trails of the brighter stars and I'd guess that it is not a problem.
Distribution for visit 40337:
15 or 30 seconds?
In case that was for me, each of Tom's phoSim simulations are 30 s exposures.
@cwwalter @sethdigel Assuming that 21.57 in V-band using a flat SED it's ~21.25 in r-band (I also checked with a MODTRAN atmosphere at airmass 1.2 and it's almost the same equivalence but I might have a bug in my code), then it means that we are getting half of what we expected, right? (unless that's a 15s exposure).
It's either that or the numbers here https://github.com/lsst-pst/syseng_throughputs/blob/master/plots/table2 and the predictions that we got for DC1 are wrong (since they agreed). Can we check the r-band sky brightness reported from OpSim for that visit, please?
Also, visit 40337 if it's a 30s exposure has around half of the expected sky level, however visit 201828 looks about right.
BTW those images are so mesmerizing.
phosim v3.7.3 is done and fixes the issue for bright extended sources a dozen comments earlier in this thread.
Excellent - thanks a lot, John!
That pixel value distribution looks really funny. I'm a little worried about it. If it is only pixels that are going to be masked, it may not be a big deal, but it is still strange.
@sethdigel or @TomGlanzman do we know what is causing the high spatial frequency power in the sky? Are there clouds? If it was just sky, I'd expect it to be smoother.
@sethdigel or @TomGlanzman do we know what is causing the high spatial frequency power in the sky? Are there clouds? If it was just sky, I'd expect it to be smoother.
I was going to ask about this too. I agree it looks mesmerizing :) ! But @belaa did a study for the SSim group using the ESO sky model (which includes sky glow correct @yoachim) and I believe she found there were no variations over the scale of single sensor like we see here.
Just a reminder that these first test runs intentionally used DC1-like configurations with the exception of adding in "quickbackground". For reference, here is the 'command file' with the physics overrides:
# commands.txt - phoSim commands/physics-overrides
# 11/13/2017 -- clone from DC1->DC2, add in 'quickbackground' config
# Enable centroid file
centroidfile 1
# Disable sensor effects
cleardefects
fringing 0
# Disable dirt
contaminationmode 0
# Set the nominal dark sky brightness
zenith_v 21.8
# Quick background
backalpha 0.1
backbeta 4.0
backgamma 1000.0
backdelta 1.0
activebuffer 600
# Number of sources handled by a single thread
sourceperthread 100
And a representative instanceCatalog (without the astrophysical objects):
$ more phosim_cat_40336.txt
rightascension 89.1788674
declination -30.1024577
mjd 59634.0910186
altitude 67.2677826
azimuth 263.6360686
filter 2
rotskypos 5.2752557
camconfig 1
dist2moon 122.9044891
moonalt -19.2875356
moondec -21.8515217
moonphase 47.7539710
moonra 246.1223288
nsnap 1
obshistid 40336
rottelpos 102.3316946
seed 40336
seeing 0.5462840
sunalt -33.7896703
vistime 30.0000000
includeobj star_cat_40336.txt.gz
includeobj gal_cat_40336.txt.gz
includeobj agn_cat_40336.txt.gz
@sethdigel is it reasonable for me to ask you to make those same images with the DC1 runs of those same configurations?
Regarding performance of the recent DC1/DC2 tests, below is the distribution of execution times (elapsed time in minutes). The first peak, around 190 min, is largely due to the first three visits (which were the first three visits in DC1), while the later part of the distribution arises from the final eight (8) visits -- which were selected for their known long execution times.
It is interesting to note that despite bug fixes and quickbackground there is still a relatively wide distribution of execution times. The good news is that many jobs complete within ~3 hours and even the very longest outlier jobs complete in <14 hours, well within the maximum 24hour queue available at NERSC/KNL.
@SimonKrughoff The config for these runs did not (AFAIK) disable clouds (see above config file) so, yes, there should be clouds.
@SimonKrughoff The config for these runs did not (AFAIK) disable clouds (see above config file) so, yes, there should be clouds.
Thanks. That's what it looked like to me too.
@jchiang87 it looks like a pile up and then sharp truncation at around 500 min. Is that real?
@SimonKrughoff - Yes, I'll do that today. For 40337 an assembled image for an early DC1 attempt (when stars and galaxies were accidently omitted) is posted here (in issue SSim_DC1#25 if the link does not work). The pattern of the glow looks the same by eye. At the time we called it air glow.
@TomGlanzman Thanks for the CPU time plots. The final 8 visits were not selected for long execution times in DC1. The criteria I used are listed in the table at #19.
@fjaviersanchez I'll look up the OpSim metadata unless someone beats met to it.
The assembled FITS images for visits 40337 and 201828 (plus 194113, which I did not post) are in /global/homes/d/digel/DC2 at NERSC. I have tried to open the directory and files to read access in case anyone wants to have a look. They are crudely assembled, and binned down to 64x fewer pixels (so they are 7650x7650). They have absolutely no coordinate information at all in the headers - I don't represent them as being good for anything but looking at.
@sethdigel Thanks for that clarification. Visit #4 should have known long-running (non-DC1) sensor-visits. The remaining seven (7) should, if I am reading the table correctly, contain at least some sensor-visits with relatively long execution times. Do you happen to have a list of the longest sensor-visit times handy? If not, I can dig them out but it will take a bit of time.
It is interesting to note that despite bug fixes and quickbackground there is still a relatively wide distribution of execution times. The good news is that many jobs complete within ~3 hours and even the very longest outlier jobs complete in <14 hours, well within the maximum 24hour queue available at NERSC/KNL.
For the DC1 visits from @sethdigel's list can we get the DC1 and 3.7/quickbackground run times for comparison? This is one of the reasons we did the test.
@fjaviersanchez Here is the OpSim metadata that I have handy for the DC1 visits that Tom ran in the DC2-phoSim-1 task. (1668469 was a Twinkles visit; I don't have the moonBright value handy for that.). The Stream numbers below are useful for looking up the output from that task. If you need something else I could get it out of the OpSim database.
Stream | obsHistID | moonAlt | moonBright | moonPhase | vSkyBright | airmass |
---|---|---|---|---|---|---|
000000 | 40336 | -19.3 | 0.0 | 47.8 | 21.52 | 1.081 |
000001 | 40337 | -19.2 | 0.0 | 47.7 | 21.53 | 1.063 |
000002 | 40338 | -19.1 | 0.0 | 47.7 | 21.55 | 1.048 |
000003 | 1668469 | -0.6 | 30.6 | 21.25 | 1.125 | |
000004 | 270676 | 19.4 | 180.1 | 76.7 | 20.28 | 1.062 |
000005 | 194113 | 25.1 | 93.3 | 45.1 | 20.69 | 1.119 |
000006 | 220091 | -49.7 | 0.0 | 0.1 | 21.26 | 1.479 |
000007 | 220090 | -49.7 | 0.0 | 0.1 | 21.29 | 1.437 |
000008 | 233988 | -7.9 | 4.2 | 24.1 | 21.51 | 1.022 |
000009 | 201828 | -31.5 | 0.0 | 31.6 | 21.39 | 1.259 |
000010 | 300306 | 4.1 | 37.6 | 38.2 | 21.11 | 1.098 |
@sethdigel According to your table above 40337 has the moon below the horizon right? If those structures were clouds shouldn't they only be brighter than the sky if the moon was up?
I do not think that they are clouds, although looking back at the SSim_DC1 issue 25, I guess that the question about whether it is airglow was left up in the air, so to speak.
If they aren't clouds than the size of the spatial features probably need to be investigated right?
Yes, I suppose so. I am not an expert. Looking at the power spectrum and simulated airglow image in figure 11.2 of the PhoSim Reference Guide I'd say that the angular scales of the features are consistent with what was intended.
Here is a comparison of the DC2-era and DC1-era (DC1-phoSim-3) versions of the 40337 visit. The images each show a very narrow range of electrons/pixel, with linear scaling. The DC2 version (on the left) is 400-520. The DC1 version is 200-320. The shift of the range by 200 electrons corresponds to the difference between the median values of the two images (475 for DC2 and 272 for DC1). The airglow (or whatever it is) seems quite a bit brighter in the DC2 image. The spatial structure looks about the same.
The DC1 file is assemble_40337_ge10_DC1.fits in the directory listed above.
Clouds only contribute extinction in phosim, as far as I know. The structure in the airglow is higher frequency than I'd expect for bluer bands.
Here is a comparison of the DC2 (left) and DC1 versions of the 201828 visit. The DC2 image has a median of 678 electrons/pixel and the DC1 image has median 115 electrons/pixel. The scaling is linear, from 610 to 740 for the DC2 image and from 50 to 180 for the DC1 image. Here too the glow is much brighter in the DC2 version. The non-uniform glow, if present in the DC1 image, is much fainter. As for all of the DC1 images (and the DC2-phoSim-1 counterparts), these are r band.
The DC1 file is assemble_201828_ge10_DC1.fits in the directory listed above.
Here is a comparison of the wall clock times for the DC2-phoSim-1 sensor visits and their DC1-phoSim-3 counterparts. Specifically, the comparison is for the wall clock times of the RunRaytrace step of these tasks, which is the step that takes the great majority of the CPU time. The comparison includes all of the streams that Tom ran, except for obsHistID 1668469 (which is not part of DC1 and had only 3 sensor visits in DC2-phoSim-3). So the comparison is based on 953 sensor visits. I matched them up by finding the mapping between Stream numbers and obsHistIDs, and taking advantage of the encoding scheme that Tom uses to include a raft ID and sensor ID in the Stream names. I am deriving the wall clock times from the starting and ending times reported in the Pipeline status page (e.g., this page for the DC2-phoSim-1 sensor visits).
The results are not what I had expected based on what I thought I understood about the speed-up. I have not tried to compare the CPU times, because I figured wall clock was a better indicator of throughput, but if it would be useful I could do it.
The plot below shows the distribution of wall clock times for the sensor visits included in the comparison. Blue dashed is DC2, red solid is DC1. The DC2 histogram matches the one that Tom posted above; the missing tail is the 3 sensor visits for 1668469).
The distribution of wall clock times for the DC1 sensor visits is clearly much broader, and in a number of cases the wall clock times in DC1 were less.
The plot below compares the wall clock times for matching sensor visits.
Here I have crudely color coded the sensor visits that correspond to the same obsHistID. For these, the observing conditions are of course the same, so the range in relative wall clock times is presumably due to the source content. I cannot say that I worked very hard to make sure that the colors for the 10 obsHistIDs included are distinguishable. The dashed line has unit slope. The maximum DC2 wall clock time is much less than the maximum DC1 wall clock time.
The plot below shows the ratio of wall clock times (DC1/DC2) vs. DC2 wall clock.
The greatest gains in throughput are for the sensor visits at the shorter wall clock times. The overall average ratio (total DC1 wall clock)/(total DC2 wall clock) = 2.09.
The table below summarizes the relative wall clock times for the individual Streams.
Stream | obsHistID | Avg. DC1/DC2 Wall | std. dev. | Avg. DC2 Wall (hrs) | # sensor visits | # with ratio < 1 |
---|---|---|---|---|---|---|
0 | 40336 | 2.69 | 1.89 | 2.64 | 19 | 4 |
1 | 40337 | 2.50 | 1.34 | 2.84 | 182 | 7 |
2 | 40338 | 2.63 | 1.21 | 3.49 | 102 | 2 |
4 | 270676 | 2.23 | 0.36 | 7.52 | 21 | 0 |
5 | 194113 | 1.51 | 0.48 | 5.21 | 125 | 13 |
6 | 220091 | 1.82 | 1.30 | 3.52 | 117 | 46 |
7 | 220090 | 1.47 | 1.22 | 3.65 | 69 | 34 |
8 | 233988 | 1.90 | 0.84 | 3.39 | 35 | 0 |
9 | 201828 | 1.95 | 1.25 | 3.56 | 173 | 50 |
10 | 300306 | 2.85 | 1.43 | 2.98 | 110 | 0 |
A table further up in this issue lists the OpSim metadata for these visits. 270676 was selected to have the maximum moonBright (combination of moonAlt and moonPhase). 201828 had the median CPU time for all the DC1 visits, and 300306 was selected to represent an average DC1 visit (in terms of CPU time).
@sethdigel A quick question about your table -- what does the column "Avg. DC2 Wall" represent? Is it the number of minutes per sensor visit? The main question is whether you are seeing the average time per visit to be the same as Tom saw for a DC1 configuration (about 5min/visit).
@sethdigel
The DC2 image has a median of 678 electrons/pixel and the DC1 image has median 115 electrons/pixel.
I'm sure you already know but for others trying to make sense of the numbers: remember that in DC1 the phoSim background level was down by a factor of something like 8 (it varied in a complicated way that went with the altitude) because of a bug. That bug should be fixed now so we expect the numbers to be different. @fjaviersanchez's questions above were about validating if the new numbers make sense.
For the sky glow: we should understand both the magnitude level and if sensor level spatial variations make sense.
The results are not what I had expected based on what I thought I understood about the speed-up. I have not tried to compare the CPU times, because I figured wall clock was a better indicator of throughput, but if it would be useful I could do it.
If it is not too much work, it would probably be good to add that to the table to make sure we that the ration of wall/cpu time isn't radically different now. Both are of course important. I think CPU time is the relevant one for how we are charged correct?
@sethdigel @cwwalter The charging is proportional to wall-clock, but I am more interested right now in the single visit time estimate. Once @sethdigel answers my question, we will know if there's a real issue or not.
@salmanhabib "Avg. DC2 Wall" is the average wall clock time in hours for the indicated visits. So for Stream 0 above, the average wall clock time for the 19 sensor visits was 2.64 hours (158 minutes). I've added the units to the table - sorry for the omission.
In the histogram that Tom posted above for his recent runs (DC2-phoSim-1, which were repeats of DC1 visits) the overall mean wall clock time per sensor visit is listed as 219 minutes. This is consistent with what I found (from the same information). I think that the effective 5 minutes wall clock per sensor visit resulted from running many visits in parallel.
@cwwalter I'll work up some CPU time comparisons. I think it would be easy to make a table of the median electrons/pixel for the DC1 and DC2 visits; it sounds like that might be useful. I also have in mind to look at instance catalog contents for the sensor visits at the extremes of the DC1/DC2 wall clock time ratios, to see if I learn anything. That would take longer.
@sethdigel Ok, so let me state this in a way to remove confusion. On a single KNL node, the current method is to run 34 independent visits, in which case one gets the 5 minute/visit number (depending on the moon, etc.) that Tom quoted and which is consistent with all other tests. As long as this is the case, we are (more or less) fine with the current allocation request. Has this been taken into account in your table or not? For DC2 we will run the KNLs fully loaded, i.e., 34 independent jobs running on each node at a time.
Also, I am still puzzled by the numbers in the table. If I look at 40336, I get 8.4min/visit, but with 40337, I get 0.94min/visit. How different are these two instances?
@salmanhabib Yes, it would take running of the order of 34 independent sensor visits at the same time to get ~5 minutes wall clock per sensor visit throughput. Regarding the table, for a given obsHistID the 'Avg. DC2 Wall' entry is derived by summing the wall clock times for each of the sensor visits with that obsHistID, then dividing by the number of sensor visits. So on average, each individual sensor visit for 40336 took 2.64 hours wall clock time.
@sethdigel Sorry for harping on this point, but when you mean an individual visit, do you really mean time to do a single ray-trace job for a single chip? I don't understand how this number can be in the hours. Maybe we should take this offline.
Yes, I meant individual sensor visit. I've edited the comment to include 'sensor' before 'visit'.
So should I be dividing this time by 34 (the number of independent visits that can be run on a KNL) to make more sense of it? That would give me 4.7min per individual sensor visit for the 2.64 hours quoted for 40336.
It depends on what you mean by making more sense. The charts and tables have wall clock times for ray tracing individual sensor visits. Dividing by 34 would give you the effective wall clock time per sensor visit for a (fully loaded, I guess) KNL node. You know better than I do which is more relevant for evaluating resource requirements.
Hmmm -- ok, the numbers we used for allocating resources are still fine I think.
Here are plots and a table comparing CPU times for the DC2-phoSim-1 and corresponding DC1-phoSim-3 sensor visits. These are analogous to the plots and tables above, but with wall clock times replaced with CPU times. (One of the DC1 sensor visits, for Task 127.20.1 - obsHistID 233988 - had a recorded CPU time of -2 sec. It is one of the 55 out of ~176k DC1 sensor visits that has this spurious value. I've left it in the plots below.) The color coding is as for the plots above. These plots and table are just for reference; as Salman pointed out, NERSC charges are related to wall clock time.
In terms of CPU time, the DC1 sensor visit times have a broader distribution than for DC2 sensor visit. In general the DC1 sensor visit CPU times are less than for the DC2 sensor visits. The sensor visits that use the most CPU time for both the DC1 and DC2 runs are for obsHistID = 270676, for which the Moon was the brightest.
In the table below Avg. DC2 CPU hours is per sensor visit.
Stream | obsHistID | Avg. DC1/DC2 CPU | std. dev. | Avg. DC2 CPU hours | # sensor visits | # with ratio < 1 |
---|---|---|---|---|---|---|
0 | 40336 | 0.78 | 0.28 | 20.23 | 19 | 16 |
1 | 40337 | 0.80 | 0.20 | 21.77 | 182 | 151 |
2 | 40338 | 0.80 | 0.18 | 26.49 | 102 | 87 |
4 | 270676 | 1.45 | 0.10 | 58.98 | 21 | 0 |
5 | 194113 | 0.81 | 0.10 | 40.29 | 125 | 122 |
6 | 220091 | 0.45 | 0.18 | 27.21 | 117 | 115 |
7 | 220090 | 0.38 | 0.17 | 28.19 | 69 | 69 |
8 | 233988 | 0.74 | 0.18 | 26.19 | 35 | 33 |
9 | 201828 | 0.47 | 0.17 | 27.53 | 173 | 171 |
10 | 300306 | 1.29 | 0.21 | 22.43 | 110 | 4 |
The table below compares the electrons/pixel in the DC1-phoSim-3 and DC2-phoSim-1 versions of the same visits. For each obsHistID, I evaluated the median electrons/pixel in each of the corresponding sensor visits in the DC1 and DC2 versions and tabulated the minimum, maximum, and average of these medians. So for obsHistID 40336, of the 19 sensor visits, the median electrons/pixel ranged from 137 to 228 with an average of 206.6. I guess that I'm not surprised that there's a significant variation across the sensors in a given visit, as the airglow (or whatever it is) component of the background is not unifrom.
Stream | obsHistID | min DC1 | max DC1 | avg DC1 | min DC2 | max DC2 | avg DC2 |
---|---|---|---|---|---|---|---|
0 | 40336 | 137 | 228 | 206.6 | 385 | 478 | 447.1 |
1 | 40337 | 162 | 296 | 271.9 | 283 | 528 | 479.1 |
2 | 40338 | 253 | 444 | 418.7 | 519 | 660 | 627.1 |
4 | 270676 | 3540 | 5855 | 5483.5 | 1512 | 1837 | 1742.4 |
5 | 194113 | 934 | 1668 | 1549.6 | 997 | 1289 | 1222.3 |
6 | 220091 | 72 | 123 | 117.0 | 663 | 847 | 797.5 |
7 | 220090 | 90 | 114 | 108.1 | 727 | 920 | 868.9 |
8 | 233988 | 363 | 457 | 436.7 | 497 | 619 | 592.0 |
9 | 201828 | 71 | 123 | 115.9 | 555 | 736 | 684.6 |
10 | 300306 | 487 | 816 | 782.7 | 392 | 498 | 473.9 |
This might or might not still be relevant with the new release of phoSim, but here is another example of a fairly bright source that has isolated bright pixels. The differences from the earlier example are 1) the source is a star, not a large galaxy, and 2) it is near the center of the CCD.
The image below is scaled logarithmically. It is excised from lsst_e_220090_f2_R20_S02_E000.fits.gz. (The bleed trail goes only to the left, I guess because the center of the CCD has a charge stop; the right edge of the saturated region is at pixel 2000.) The background level around the star is ~1000 electrons/pixel. The individual bright pixels around the star have ~2600 electrons/pixel.
This is the definition of the star in the instance catalog
object 470457101316 91.2528067 -29.2427727 13.7538702 starSED/kurucz/km20_4250.fits_g00_4370.gz 0 0 0 0 0 0 point none CCM 0.102673931 3.1
This does look like another bug. @johnrpeterson can you look into this? The truncated bleed trail also looks weird. I'd expect it to spread along the channel stop, but I guess the physics in this case is not that intuitive.
it looks like the optimizer is to aggressive for this case rather close to the charge stop as well. this should be easy to fix.
as for the variation-- its a combination of clouds & airglow. i think the airglow tends to have smaller scale spatial variation.
Since we will not be able to generate the full protoDC2 data set before the Sprint Week, we would like to produce an initial subset that would still be useful for the Working Groups in the near term.
We would like to do the full 25 sq degrees, so downscoping for the initial stage would mean fewer bands and a shorter observation time frame.
Questions: