DC1 phoSim production - Githubissues

TomGlanzman commented 7 years ago

This issue is intended to be a continuous log of the DC1 phoSim production at NERSC.

To start things off, a summary update of this project was given on Monday (12 Dec 2016) in the DESC-CI meeting (https://confluence.slac.stanford.edu/x/SryMCw). The initial workflow is being developed to include the following features:

phoSim v3.6.0 (released 2 Dec 2016)
cluster_submit technology (splits instance catalog trimming from raytracing, and parallelizes at chip level)
multi-threaded operation (with a modest number of threads, probably 4-16)
operation on NERSC's cori-haswell using a new 'pilot job' feature of the SLAC workflow engine
inclusion of dmtcp checkpointing to allow for very long-running simulated visits

Would like to get the first test runs going in the next week or so...but possibly not until after the holidays.

Stay tuned!

johnrpeterson commented 7 years ago

soon, i think she was adding them together and looking at them carefully first. will be on both globus and NCSA when she is done.

john

On Feb 7, 2017, at 2:21 PM, SimonKrughoff notifications@github.com<mailto:notifications@github.com> wrote:

This is not an abstract argument, as En-Hsin already ran the flats this way, which is as large as a data challenge.

@johnrpetersonhttps://github.com/johnrpeterson when are these going to show up someplace? I don't see them either on globus or at NCSA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-278111294, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8gYt6hmx73H6YZ_ku6abkkrElpwIks5raMQwgaJpZM4LNYlW.

SimonKrughoff commented 7 years ago

soon, i think she was adding them together and looking at them carefully first.

Just to be clear, on NCSA we will have the raws, right? I don't think we can use the aggregated ones.

johnrpeterson commented 7 years ago

yup

On Feb 7, 2017, at 2:41 PM, SimonKrughoff notifications@github.com<mailto:notifications@github.com> wrote:

soon, i think she was adding them together and looking at them carefully first. Just to be clear, on NCSA we will have the raws, right? I don't think we can use the aggregated ones.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-278116883, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8l8-MNE_ozI6MqAjyZ-bQUa60P0Yks5raMjSgaJpZM4LNYlW.

———————————

John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193

TomGlanzman commented 7 years ago

@johnrpeterson thanks for the advice. I do worry about the long term stability of attempting to deconstruct phosim.py/cluster_submit.py. Out of curiosity, do you see something wrong in what we did? But can you be more explicit as to how you suggest we use cluster_submit.py such that we can integrate it with the workflow engine?

Note that the current (beta?) version of cluster_submit.py we have available has only these options:

` $ python cluster_submit.py --help cluster_submit.py: v1.0 /global/u2/g/glanzman/phosim_cluster/cluster_submit.py: v1.0 Usage: cluster_submit.py dagManFile [ ...]

Options: -h, --help show this help message and exit -w WORKDIR, --work=WORKDIR temporary work directory -o OUTPUTDIR, --output=OUTPUTDIR output directory `

Possibilities for using cluster_submit.py that come to mind:

1) Use cluster submit as-is but somehow squash the submitted jobs before they run, then collect and modify batch parameters in the .submit files, and finally submit them as we need to.

2) Modify cluster_submit.py to not submit batch jobs, and then proceed as 1) above

3) Modify cluster_submit.py to have additional options for specifying batch options, e.g., runtime limit, partition, memory, job dependencies (as that is built into the workflow engine), as well as not to submit batch jobs, then proceed as above.

4) Take the concepts of cluster_submit and roll our own (this is what we have already done).

There are probably other configurations, but having looked into cluster_submit.py it is not clear how we can use it as-is. The workflow engine is likely unfamiliar to you so perhaps we could discuss to decide upon a rational course of action?

TomGlanzman commented 7 years ago

@johnrpeterson one other question about using cluster_submit.py is how we should specify a list of sensors to be simulated. Humna's visit database also has, for each visit, a variable list of sensors. Will the combination of "phosim.py ... -s "list of sensors" combined with cluster_submit.py give us the desired result? I am a bit worried due to your earlier admonition not to use -s in production due to certain inefficiencies.

And if using -s is not the answer, how do you recommend we meet this requirement?

(The impact of simulating only those sensors of interest is nearly a factor of 2, so is quite significant.)

sethdigel commented 7 years ago

Comparing phosim_40336.txt (the master InstanceCatalog) and star_cat_40336.txt (the catalog of stars): the stars are centered on the nominal PhoSim field of view.

And I see that the instance catalogs cover a region with radius 4 deg, so there's no chance that they could miss any particular CCD in the field of view. (cori is back online.)

johnrpeterson commented 7 years ago

Tom-

First a disclaimer: these comments only apply to the massive data challenges. when you or anyone is doing something individually or smaller scale, please use phosim as creatively with whatever settings you want. i’m only concerned with effectively using the 1 Million+ CPU hours.

Ok, yes, please do option 1, 2, or 3 but not 4. This is both to make sure we make no mistakes and will be able to debug errors/cluster efficiencies, but also because both cluster_submit and phosim.py may change in the future, so it will be more efficient for all of us.

Whether you go with #3 or #1/2, depends on how much workflow stuff you need there. If its just a few minor command changes, then certainly work with Glenn to figure out what you have to add and then i think #3 is better. If its a lot of stuff, then just put the output of cluster_submit to a file, and then parse that file and add your appropriate workflow-related commands and then you are doing #1/2 (i don’t see a lot of difference between either 1 or 2).

We can certainly talk next week as well.

john

On Feb 7, 2017, at 7:40 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:

@johnrpetersonhttps://github.com/johnrpeterson thanks for the advice. I do worry about the long term stability of attempting to deconstruct phosim.py/cluster_submit.py. Out of curiosity, do you see something wrong in what we did? But can you be more explicit as to how you suggest we use cluster_submit.py such that we can integrate it with the workflow engine?

Note that the current (beta?) version of cluster_submit.py we have available has only these options:

` $ python cluster_submit.py --help cluster_submit.py: v1.0 /global/u2/g/glanzman/phosim_cluster/cluster_submit.py: v1.0 Usage: cluster_submit.py dagManFile [ ...]

Options: -h, --help show this help message and exit -w WORKDIR, --work=WORKDIR temporary work directory -o OUTPUTDIR, --output=OUTPUTDIR output directory `

Possibilities for using cluster_submit.py that come to mind:

Use cluster submit as-is but somehow squash the submitted jobs before they run, then collect and modify batch parameters in the .submit files, and finally submit them as we need to.
Modify cluster_submit.py to not submit batch jobs, and then proceed as 1) above
Modify cluster_submit.py to have additional options for specifying batch options, e.g., runtime limit, partition, memory, job dependencies (as that is built into the workflow engine), as well as not to submit batch jobs, then proceed as above.
Take the concepts of cluster_submit and roll our own (this is what we have already done).

There are probably other configurations, but having looked into cluster_submit.py it is not clear how we can use it as-is. The workflow engine is likely unfamiliar to you so perhaps we could discuss to decide upon a rational course of action?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-278192314, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8j8DCXVidOXzX3eLvf2g5RkKOHP4ks5raQ7qgaJpZM4LNYlW.

———————————

John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193

johnrpeterson commented 7 years ago

Tom-

yes, so i am only against doing “-s looping” in large data challenges, because as i said last year that is an unintended use of phosim.py which results in large memory, I/O, and cpu parallelization inefficiency factors.

so the main problem is that the looping is done outside of phosim.py (not particularly with the -s option itself). so it is ok and you will get no reduction in optimal efficiency by using a single run of phosim.py and then use the string of chips in the -s option.

so in other words, here’s what it comes down to. if you only wanted to do say 3 chips:

BAD: phosim catalog -s R22_S11 phosim catalog -s R22_S12 phosim catalog -s R22_S21

GOOD: phosim catalog -s ‘R22_S11|R22_S12|R22_S21’

so in other words, let phosim.py control the looping.

[again this comment only applies to massive data challenges; running this locally, it doesn’t matter]

john

On Feb 7, 2017, at 7:48 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:

@johnrpetersonhttps://github.com/johnrpeterson one other question about using cluster_submit.py is how we should specify a list of sensors to be simulated. Humna's visit database also has, for each visit, a variable list of sensors. Will the combination of "phosim.py ... -s "list of sensors" combined with cluster_submit.py give us the desired result? I am a bit worried due to your earlier admonition not to use -s in production due to certain inefficiencies.

And if using -s is not the answer, how do you recommend we meet this requirement?

(The impact of simulating only those sensors of interest is nearly a factor of 2, so is quite significant.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-278193780, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8suZsN4ch1hD1dcNyRefeUT3hlHfks5raRDngaJpZM4LNYlW.

sethdigel commented 7 years ago

Here's a crudely assembled focal plane image for the 10th visit (obsHistID 193157), for which the altitude of the Moon was +26.5 deg. The Moon is clearly having an effect on the sky brightness (and the phoSim ray tracing steps took several times more CPU time than for visit 40337 above when the Moon was below the horizon.) One version of the image has histogram equalization scaling and the other has linear scaling (which shows the vignetting better).

assemble_193157 assemble_193157_linear

The hole in the images is at position R23_S10. This position is in the list of chipNames for this visit in Humna's pickle file, so presumably it was simulated. I haven't parsed the organization of the Streams for the DC1-phoSim-2 run yet to find the log for the job that handled simulating the data for that position. As far as I can tell, none of the Raytrace jobs crashed. A number did terminate when the cori file system went offline.

sethdigel commented 7 years ago

Regarding the missing image for R23_S10 in the visit shown above, the corresponding Stream (10.11.3) is also missing from the RunRaytrace step here. It is just a gap in the sequence. It is not listed among the terminated streams either. In the corresponding earlier launchSensor-jy step for this visit, no substream for this sensor is created. So it is acting as if that sensor is not on the list of included sensors for this visit (although as noted above R23_S10 is on the list).

[Added later:] I see that the workflow step that generates the trimcat files makes them for every sensor in the focal plane, regardless of whether they are are on Humna's list. So the drop out of R23_S10 does not happen before the RunRaytrace step.

egawiser commented 7 years ago

@sethdigel that's helpful to know. I noticed in your earlier image from Feb. 6 (one has to scroll up slowly to see it) there's also a single central sensor missing, but it appears to be a different one. If this is not a coincidence, could there be some kind of "end of the loop" bug where the last sensor listed doesn't get handled right?

sethdigel commented 7 years ago

Good question. It is not the last sensor in the raft. The pipeline is going through them in order 00, 01...21, 22 when it makes the tasks. I was thinking of setting up a systematic comparison between the sensors in Humna's lists and the ones that actually got simulated for each visit to see if that reveals something.

sethdigel commented 7 years ago

@cwwalter I think that the problem is in how we have decided to "manually" run the tasks that are normally run by phosim.py. If it at all makes sense to run the workflow scripts on a local machine, I would recommend that as the way to debug things. Not knowing anything about what goes on under the hood of the workflow, I cannot say more.

I think that @danielsf is right about the issue. The RunRaytrace step in the pipeline executes the phoSim raytrace code. In phosim.py, raytrace takes input from a file called raytrace_*.pars, and only that file. (* stands for a combination of the observation ID, chip ID, and something called eid.) The raytrace_*.pars files generated in the pipeline (for example, the one at /global/cscratch1/sd/desc/Pipeline-tasks/DC1-phoSim-2/000000/work/raytrace_40336_R21_S22_E0000.pars) do not include any source definitions. In the phosim.py script, the raytrace_*.pars file is compiled from several other .pars files that define, e.g., the observation metadata, the atmosphere, etc., as well as a file called trimcatalog_*.pars (where here * stands for the observation ID and the chip ID). The pipeline is generating these trimcatalog_*.pars files but not incorporating the contents into the raytrace*.pars. I think that is the reason why the generated images have only the background sources.

johnrpeterson commented 7 years ago

…please switch over to cluster_submit.py. we can’t possibly debug this kind of hacking.

john

On Feb 10, 2017, at 4:01 AM, Seth Digel notifications@github.com<mailto:notifications@github.com> wrote:

@cwwalterhttps://github.com/cwwalter I think that the problem is in how we have decided to "manually" run the tasks that are normally run by phosim.py. If it at all makes sense to run the workflow scripts on a local machine, I would recommend that as the way to debug things. Not knowing anything about what goes on under the hood of the workflow, I cannot say more.

I think that @danielsfhttps://github.com/danielsf is right about the issue. The RunRaytrace step in the pipeline executes the phoSim raytrace code. In phosim.py, raytrace takes input from a file called raytrace.pars, and only that file. ( stands for a combination of the observation ID, chip ID, and something called eid.) The raytrace.pars files generated in the pipeline (for example, the one at /global/cscratch1/sd/desc/Pipeline-tasks/DC1-phoSim-2/000000/work/raytrace_40336_R21_S22_E0000.pars) do not include any source definitions. In the phosim.py script, the raytrace.pars file is compiled from several other .pars files that define, e.g., the observation metadata, the atmosphere, etc., as well as a file called trimcatalog.pars (where here stands for the observation ID and the chip ID). The pipeline is generating these trimcatalog.pars files but not incorporating the contents into the raytrace_.pars. I think that is the reason why the generated images have only the background sources.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-278891831, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8qVfl9n6BCl5MKrAFgagM0wojBkAks5rbCeCgaJpZM4LNYlW.

TomGlanzman commented 7 years ago

Folks, a quick update on phoSim running for DC1.

First, after discussion with @johnrpeterson have come to an understanding and an agreement to move to using the cluster_submit.py script for future projects. Project-related customizations may need to be done for large-scale production but I hope generic changes can be rolled back into the official version of this script. For the upcoming DC1 project, I will work both to get the current system running, and to begin the migration to cluster_submit.py, giving the former priority as time is critical.

Next: The Case of the Missing Stellar Objects has been solved. This was my bug for not catching the proper way to concatenate the trimmed catalog to the parameter file used by raytrace. Indeed, not a single star, agn, or galaxy was simulated in the first set of trial runs. This has been fixed and tests are underway. Note: this is exactly the sort of problem the phoSim team is worried about when they hear about deconstructing their scripts as it requires using undocumented 'internals' of their system -- which could change without notice in the next release. And that is also why I will begin the migration to using cluster_submit.py

Next: The Case of the Missing Sensors has been solved. This was really quick and easy to solve once @sethdigel mentioned the problem. [For the curious, this was a bug in splitting a long python list. It would affect only visits which required >100 sensors to be simulated, and would cause the 100th sensor to be dropped. Once the list of sensors was extracted from Humna's database, it had to be communicated back to the workflow engine for delivery to downstream workflow steps. A limitation in the available workflow mechanism, using email to transfer such data, is that no single string of data may exceed about 990 bytes. As sensor IDs are of the form Rxx_Syy, an entire focal plane of 189 sensors would exceed this limit, so the list had to be split.] It is worth pointing out that this bug was in no way connected with phoSim or the way I have deconstructed phoSim into components. Using cluster_submit.py would have exposed exactly the same problem.]

Finally, another problem which had been concealed by the previous problem has been fixed. This had to do with the particular collection of SEDs needed by the instanceCatalog. We had gotten into the habit of automatically installing the "Summer 2012" collection of SEDs. However, we need, instead, the collection of SEDs shipped with the Winter 2016 lsst_sims release. (Thanks to @danielsf for pointing this out.)

I have a single sensor test running at present and, if all looks good, I plan to try running a few new trial visits over the coming days. [I am on vacation in Oregon at the moment, so this may or may not happen before next Monday.]

sethdigel commented 7 years ago

Tom has got the next five planned visits running. None of them is entirely complete yet, but many single-sensor images finished. And they clearly have stars and galaxies in them. Here's the image for the R23_S12 sensor in the visit with obsHistID 194112 (log scaled and a bit truncated at the bright end)

194112_f2_r23_s12_e000

Note that ds9 complains about not finding a longitude axis specification that it recognizes ("A latitude axis ('DEC--TAN') was found without a corresponding longitude axis"). The CTYPE1 keyword is set to 'RAC--TAN'. It should be 'RA---TAN'. This is a bug that is fixed in PhoSim v3.6.1, as John announced on Friday. Running these images through DM probably will require manually fixing this keyword.

TomGlanzman commented 7 years ago

The instanceCatalogs for DC1 currently have no magnitude cut. The Twinkles project had a magnitude cut of 11. Bright sources have a significant impact on processing time. Should there be a magnitude cut for DC1 and, if so, at what magnitude?

TomGlanzman commented 7 years ago

The DC1 workflow currently uses phoSim v3.6.0. Seth has identified an interesting reason to move to the recently released v3.6.1. Are there any reasons not to upgrade?

egawiser commented 7 years ago

We had agreed to set a minimum magnitude of 10 i.e., by re-setting any brighter magnitudes by hand to 10.0 and recording the positions and original magnitudes of those objects for possible masking after data reduction. See Issue #22.

TomGlanzman commented 7 years ago

Thanks @egawiser, I had not read that issue and so its recommendation has not been implemented in the workflow. (Are there other requirements lurking in other issues that I should know about??)

To @danielsf, is this adjustment and bookkeeping something that your catalog generation script can/should do? Or is the bookkeeping really necessary: can one not simply query fatboy with the appropriate visitID data to get all m<10 objects?

danielsf commented 7 years ago

The sims code can do what Eric described. I will update the script.

I will add the script to this repo so that we can keep track of all of the changes we are making (I had previously emailed the script to Tom, foolishly assuming that I would nail it on the first try). I'll ping you, Tom, when it is here.

sethdigel commented 7 years ago

All of the DC1 visits that Tom started have either finished or been terminated (run out of time). For the 1630 sensor-visits that finished, here is a plot of the reported CPU time for the Raytrace step vs. the effective number of threads for the execution. The latter is defined as the ratio of the wall clock time to the reported CPU time. Each job ran with 8 requested threads. (One job reported a CPU time of -2 s and has been omitted.)

show_threads

The dashed line represents the wall clock time limit, which at least for some of these jobs was 30 hours.

Some groupings and trends are clear. The simulations included various combinations of Moon altitude and (as we now realize) bright stars. With some matching against minion meta data and the instance catalogs it should be possible to figure out which combinations correspond to which regions.

The 'Effective # of Threads' is a kind of proxy for efficiency - the higher the better.

sethdigel commented 7 years ago

Tom's recent set of runs were for visits with obsHistIDs 194112, 194113, 194114, 194125, and 195754. As noted above, a number of jobs timed out, either because of bright stars or the moon. The most successful in terms of numbers of sensor images completed is 194112, which has 73 sensor images completed out of a nominal 119 (non-corner raft sensors in Humna's list for this visit). Here's a crudely assembled image from the 73 CCDs.

assemble_194112

I have not checked yet, but the blockiness of the missing sensor images is consistent with bright stars near the centers of the empty regions. (A bright star gets assigned to the trimcat files of a magnitude-dependent number of CCDs, not just the CCD it 'belongs to'.)

The image for sensor R42_S21 has an interesting feature in it - scattered light from a bright star? Here's an image for that sensor.

194112_2_r42_s21_e000

johnrpeterson commented 7 years ago

wow, that’s a pretty nice ghost on the second image. unfortunately, you might have to put in a magnitude cut and then won’t see these any more.

john

On Feb 16, 2017, at 2:47 AM, Seth Digel notifications@github.com<mailto:notifications@github.com> wrote:

Tom's recent set of runs were for visits with obsHistIDs 194112, 194113, 194114, 194125, and 195754. As noted above, a number of jobs timed out, either because of bright stars or the moon. The most successful, in terms of numbers of sensor images completed is 194112, which has 73 sensor images completed out of a nominal 119 (non-corner raft sensors in Humna's list for this visit). Here's a crudely assembled image from the 73 CCDs.

[assemble_194112]https://cloud.githubusercontent.com/assets/6035835/23011931/87b4a0e6-f3d8-11e6-91cc-387bc4992e56.png

I have not checked yet, but the blockiness of the missing sensor images is consistent with bright stars near the centers of the empty regions. (A bright star gets assigned to the trimcat files of a magnitude-dependent number of CCDs, not just the CCD it 'belongs to'.)

The image for sensor R42_S21 has an interesting feature in it - scattered light from a bright star? Here's an image for that sensor.

[194112_2_r42_s21_e000]https://cloud.githubusercontent.com/assets/6035835/23012019/f7349f84-f3d8-11e6-87b1-fdd985d91def.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-280257003, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8viXHKLcCtmftYyitDfb1b0iJfpqks5rc_8EgaJpZM4LNYlW.

———————————

John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193

TomGlanzman commented 7 years ago

Ongoing discussion this past week about phoSim execution times. One example, run interactively at NERSC (sensor R01_S02, obsHistID 40336, 8 threads, stream 0.0.2) using current instanceCatalog took 53 hours to complete.

During this run, Mustafa produced a couple of plots on CPU utilization for the first couple of hours: phosimcpuutilizationnersc (time bins are 3 seconds)

phosimcpuhistonersc

Talking with John Peterson at this week's meeting, he suspected bright stars being responsible for the long execution time. Looking into the instanceCatalog, I found one star with m=8.24 and another with m=10.45. So then ran a second experiment, same sensor/visit, but removing all stars with magnitude <11. The execution time reduced to slightly over 6 hours.

SimonKrughoff commented 7 years ago

I don't think that's a ghost. I know that some ghosts are really sharp, but that doesn't look right to me. I also would expect other structure from a ghost that bright. @sethdigel can you point me to that image on NERSC?

tonytyson commented 7 years ago

I agree

Sent from my iPhone

On Feb 16, 2017, at 10:21, SimonKrughoff notifications@github.com wrote:

I don't think that's a ghost. I know that some ghosts are really sharp, but that doesn't look right to me. I also would expect other structure from a ghost that bright. @sethdigel can you point me to that image on NERSC?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

sethdigel commented 7 years ago

@SimonKrughoff /global/projecta/projectdirs/lsst/production/DC1/DC1-phoSim-2/output/000030/lsst_e_194112_f2_R42_S21_E000.fits.gz

SimonKrughoff commented 7 years ago

O.K. I am more certain that this is not a ghost. Here's what I did:

Cut out a section of the image where the ghost is basically along the columns [0:4000][0:1500]
Column by column, calculate the standard deviation and the median of the column
Plot (standard deviation)/sqrt(median) as a function of column. If the noise is Poisson, as it should be for both the ghost and the background, this should be one everywhere.

This is that plot and it shows that the noise is ~4.5 times higher in the ghost than out of it. The median data value is higher in that range, but not nearly enough to account for the increase in noise. The line is at a value of one.

I don't know how to track this down.

sethdigel commented 7 years ago

@SimonKrughoff The brightest star in the trimcat file for this sensor visit (/global/cscratch1/sd/desc/Pipeline-tasks/DC1-phoSim-2/000030/work/trimcatalog_194112_R42_S21.pars) is magnitude 7.8 and well off the footprint of the CCD. In the assembled image above, it would be in the (missing) image that is one sensor to the left and one down from this one (R42_S10).

SimonKrughoff commented 7 years ago

It's interesting that there is a bright star so close, but I still don't think it would be that. Is it possible it's an artifact of the checkpointing?

Is it crazy to try to just re-simulate it to see if this goes away?

TomGlanzman commented 7 years ago

No checkpointing at this time...

sethdigel commented 7 years ago

Here's a ds9 image for that sensor, with a coordinate overlay and the positions of the brightest stars in the trim file. (I'm not sure why the RA axis is increasing to the right. My assertion above that the brightest star is off to the lower left should have been that the star is off to the lower right.) To get ds9 to understand the coordinate specifications I changed CTYPE1 from RAC--TAN to RA---TAN.

194112_2_r42_s21_e000_reg

The seven brightest stars are indicated; their magnitudes range from 11.9 to 7.8, with circle size scaling with decreasing magnitude. I had thought that maybe the brightest star would be at the center of curvature of the inner edge of the ghost, but it is not, and I can't say that I really understand how the ghost image might be related to the star.

sethdigel commented 7 years ago

Sorry for the spam. The centroid file for this run says that the bright feature is made up of photons from the magnitude 7.8 star (which is well off the CCD). The ID of the star is 993060380676. The centroid file for the run (which is located in the same directory as the FITS image at NERSC) indicates that the image has 120M photons from this star, and the centroid position is basically at the center of the ghost.

johnrpeterson commented 7 years ago

all-

its a intersection of 2 (maybe 3?) ghosts. see the plot below and see the structures in the lower corners. you should probably be able to see the fainter non-overlapped ghosts if you adjust the contrast.

regards,

john

[cid:C2E84F77-5CBC-4E18-94E3-E0C426D3DAA5@itap.purdue.edu]

On Feb 17, 2017, at 3:32 AM, Seth Digel notifications@github.com<mailto:notifications@github.com> wrote:

Sorry for the spam. The centroid file for this run says that the bright feature is made up of photons from the magnitude 7.8 star (which is well off the CCD). The ID of the star is 993060380676. The centroid file for the run (which is located in the same directory as the FITS image at NERSC) indicates that the image has 120M photons from this star, and the centroid position is basically at the center of the ghost.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-280587496, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8npDSNw3LKzZ6c0jmouJS9paJUWXks5rdVs7gaJpZM4LNYlW.

———————————

John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193

egawiser commented 7 years ago

I'm still spooked by Simon's noise analysis from yesterday, implying that the noise in the ghost is ~4X higher than expected from Poisson fluctuations. Is there a way to re-do that using small square regions instead of columns to check if it's really the case?

johnrpeterson commented 7 years ago

Eric-

i don’t think that’s possible given that ghosts are done one photon at a time, but please forward that info if that’s true.

note that i think the plan was to rerun with a magnitude cut of 11 in the interest of computing efficiency, which is going to remove all visible ghosts.

john

On Feb 17, 2017, at 12:09 PM, Eric Gawiser notifications@github.com<mailto:notifications@github.com> wrote:

I'm still spooked by Simon's noise analysis from yesterday, implying that the noise in the ghost is ~4X higher than expected from Poisson fluctuations. Is there a way to re-do that using small square regions instead of columns to check if it's really the case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-280708756, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8keh6DsAVe7o6xrMtX42q5jPTZAYks5rddRYgaJpZM4LNYlW.

dkirkby commented 7 years ago

@egawiser Perhaps its not increased (sky) noise but rather a ghost image of the star at very low surface brightness, which would have visible shot noise similar to sky.

johnrpeterson commented 7 years ago

ghosts

here is the image if you couldn't see it before

SimonKrughoff commented 7 years ago

@dkirkby the noise in the region with the ghost is significantly higher than the increase in flux from the scattered light would suggest.

@johnrpeterson is it possible this is a bug in the way photons are packaged up for bright stars?

egawiser commented 7 years ago

Simon, I think this is a really important thing to check. But I'm worried that your stddev along columns might be overestimating the true noise level in the ghost due to enough pixels in each column being outside the ghost. Hence my suggestion to check median value vs. stddev for small squares fully within the ghost rather than just columns. (I assume that values are in photons rather than ADU, since the latter would skew the expected ratio of stddev/sqrt(median_value) away from 1.)

SimonKrughoff commented 7 years ago

@egawiser I remade the plot taking 100x100 pixel boxes devoid of any obvious sources. I've overplotted them on the previous figure as red squares. I think this tells the same story as the previous plot.

sethdigel commented 7 years ago

Here's an image version of Simon's analysis. It shows the ratio of the standard deviation to the square root of the median value for a moving window of 51x51 pixels. The standard deviation (and median) were evaluated iteratively, removing >3 sigma outliers from the median after the first iteration, to (somewhat) suppress bright stars.

194112_f2_r23_s12_e000_noise_ratio

The ghost region stands out for its large variance. I think that Simon is probably right that this is related to the batching of photons from bright sources. In this case the ratio suggests that the photons are in groups of ~15. If this is right I wouldn't say it is a bug but it is a non-ideality.

sethdigel commented 7 years ago

For Hack Day I worked on compiling the metadata from the DC1 sensor visits (execution information from the workflow engine, observation metadata from the OpSim run, and brightest star from the trimcat files). Tom (with help from Brian) wrote a script to compile the information about the execution. The resulting csv file and some details are posted in Confluence here.

So far only 134 sensor visits have been completed (counting only the visits with stars and galaxies included). And these correspond to only four different visits (obsHistIDs 194112, 194113, 194114, 195754), so there's not a lot to study yet, but some correlations are clear.

Here's an updated version of a plot posted above, now color coded (red = Moon above the horizon) with symbol size indicating the magnitude of the brightest star (range 7.8-12.6).

dc1_eff_cpu

It shows how efficiently the 8 threads are being used in each run. The sensor visits that did not complete all crossed that dashed line. Some more details are in the Confluence posting. For DC1 this is sort of academic, because stars brighter than magnitude 10 will be limited to that magnitude.

johnrpeterson commented 7 years ago

On Feb 17, 2017, at 4:33 PM, SimonKrughoff notifications@github.com<mailto:notifications@github.com> wrote:

@dkirkbyhttps://github.com/dkirkby the noise in the region with the ghost is significantly higher than the increase in flux from the scattered light would suggest.

@johnrpetersonhttps://github.com/johnrpeterson is it possible this is a bug in the way photons are packaged up for bright stars?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-280772538, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8kZULZ2HUMaxy3XTI4Sf7soceINjks5rdhJDgaJpZM4LNYlW.

so the photons are only packaged up for the core of the bright star. that is perfectly fine because all of those will go into a bright bleed trail. so i don’t see how the ghost photons would be packaged.

one theory i just had is that these aren’t ghost photons after all. perhaps they are direct light that is just skimming past the edges of some of the mirrors/lenses along some obscure light path. that light maybe should have been baffled by phosim. then that might explain it.

to test this, could someone send me:

1) the 20 observing parameters in this catalog 2) the line in the catalog for the bright star 3) the name of the CCD where there is this feature

then we can dig in and look at the rays in detail.

i suppose this is all irrelevant for DC1 as this star is going to get removed though, but its good to look into this carefully for future runs with bright stars.

john

TomGlanzman commented 7 years ago

@johnrpeterson I can readily give you 2 out of 3:

1) observing paramaters: rightascension 95.1629994 declination -26.4636028 mjd 59840.3723246 altitude 59.0459284 azimuth 91.7742935 filter 2 rotskypos 18.1615025 camconfig 1 dist2moon 54.8439212 moonalt 25.0324115 moondec 27.8405922 moonphase 45.0855070 moonra 90.9648422 nsnap 1 obshistid 194112 rottelpos -87.1298638 seed 194112 seeing 0.5525290 sunalt -22.6750912 vistime 30.0000000

2) I've been redoing this visit with Scott's latest catalog generator (which reassigns stars with m>10 to m=10), so perhaps @sethdigel has that object handy? If not, let me know and I'll rerun everything without the magnitude cut.

3) R42_S21

For reference, this was obsHistID 194112 and workflow stream 30.19.7.

sethdigel commented 7 years ago

Here is the instance catalog entry for that star.

object 993060380676 97.3143044 -26.0114501 7.82037784 starSED/kurucz/km15_5000.fits_g00_5140.gz 0 0 0 0 0 0 point none CCM 0.199903184 3.1

johnrpeterson commented 7 years ago

Tom-

Why are you using a 30 second visit time for a single snap? This is twice as long as the nominal exposure.

john

On Feb 21, 2017, at 1:32 PM, Tom Glanzman notifications@github.com<mailto:notifications@github.com> wrote:

@johnrpetersonhttps://github.com/johnrpeterson I can readily give you 2 out of 3:

observing paramaters: rightascension 95.1629994 declination -26.4636028 mjd 59840.3723246 altitude 59.0459284 azimuth 91.7742935 filter 2 rotskypos 18.1615025 camconfig 1 dist2moon 54.8439212 moonalt 25.0324115 moondec 27.8405922 moonphase 45.0855070 moonra 90.9648422 nsnap 1 obshistid 194112 rottelpos -87.1298638 seed 194112 seeing 0.5525290 sunalt -22.6750912 vistime 30.0000000
I've been redoing this visit with Scott's latest catalog generator (which reassigns stars with m>10 to m=10), so perhaps @sethdigelhttps://github.com/sethdigel has that object handy? If not, let me know and I'll rerun everything without the magnitude cut.
R42_S21

For reference, this was obsHistID 194112 and workflow stream 30.19.7.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/25#issuecomment-281434217, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8qObnpGprB3mmmMXrjIFEllEewE2ks5rey2hgaJpZM4LNYlW.

———————————

John R. Peterson Assoc. Professor of Physics and Astronomy Department of Physics and Astronomy Purdue University 525 Northwestern Ave. West Lafayette, IN 47906 (765) 494-5193

cwwalter commented 7 years ago

We are doing one 30 second exposure instead of 2 15 second snaps. Currently we aren't really setup (on the DM side) to deal with the two exposures yet.

TomGlanzman commented 7 years ago

Two news items.

First, workflow checkpointing is being tested/debuged at this time. Hopefully it can be available for DC1.

Second, another phoSim multithread timing test was performed on a randomly selected sensor visit. Just the raytrace step on an unloaded cori login node using up to 8 threads was monitored. Note that any stars with mag brighter than 10 have been reset to m=10. Also note "moonalt 25.0324115" and "moonphase 45.0855070". The run required 16h 54m of elapsed time.

The following "Mustafa plots" show thread utilization as a function of time, and overall thread utilization for the job. Sampling time is every three seconds.

8-threads

There are extended periods of time where allocated threads are idle.

8-threads-overall

The overall thread utilization is about 42% for this run.

Finally, a distribution of astrophysical object magnitudes for this run. (Note the 'tty' histogram format including bin contents, e.g., bin "10" includes objects with 10<=m<11) This confirms no objects with magnitude < 10, and only three objects with magnitude between 10 and 11, etc.

0 : 0 : 1 : 0 : 2 : 0 : 3 : 0 : 4 : 0 : 5 : 0 : 6 : 0 : 7 : 0 : 8 : 0 : 9 : 0 : 10 : 3 : 11 : 3 : 12 : 7 : 13 : 7 : 14 : 14 : 15 : 36 : 16 : 55 : 17 : 68 : 18 : 89 : 19 : 165 : 20 : 295 : * 21 : 684 : * 22 : 1735 : *** 23 : 3722 : * 24 : 6790 : ***** 25 : 11532 : * 26 : 15409 : **** 27 : 12950 : ***** 28 : 5027 : ** 29 : 2888 : ** 30 : 1787 : ***** 31 : 576 : ** 32 : 147 : 33 : 38 : 34 : 21 : 35 : 3 : 36 : 1 : 37 : 1 : 38 : 1 : 39 : 0 :

===================================

One conclusion is that even with multi-threading, phoSim runs continue to be quite long on the scale of batch queue limits (36-48h depending on number of nodes requested). All eight threads in this example were used for only 15% of the job's duration.

=================================

For reference, this test run was Visit 194112, sensor R42_S21, workflow stream 30.19.7

2017-02-21 17:12:02,048 INFO in runRaytrace.py line 13: Start 2017-02-22 10:06:09,772 INFO in runRaytrace.py line 97: All done

Elapsed time = 16.9 hours (16h 54m)

where the job ran: /global/cscratch1/sd/desc/Pipeline-tasks/DC1-phoSim-2/000030/work

command executed: /global/common/cori/contrib/lsst/phosim/v3.6/bin/raytrace < raytrace_194112_R42_S21_E000_0.pars

resultant log: /global/projecta/projectdirs/lsst/production/DC1/DC1-phoSim-2/logs/DC1-phoSim-2/0.105/trim/singleSensor/RunRaytrace/030/019/007/interactive.log

LSSTDESC / SSim_DC1

DC1 phoSim production #25