Run 2.0i final settings

LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.

BSD 3-Clause "New" or "Revised" License

11 stars 7 forks source link

Run 2.0i final settings #263

Closed katrinheitmann closed 5 years ago

katrinheitmann commented 5 years ago

All, the discussion in issue #261 has gotten quite far off from the topic of that issue (checklist for cosmoDC2) and I would like to put together a list of options here that we have to consider to make progress. Please feel free to edit the notes below, in particular @jchiang87, @aphearin, and @villarrealas

Plan A: Keep the full area and depth that we currently have in cosmoDC2

cosmoDC2: no changes, Patricia will create smaller healpixels to enable creation of instance catalogs
imsim set up on Theta: memory requirement per sensor visit goes up by a factor of ~2. For how we've been running on Theta KNL (64 sensor visits per node, with 192GB per node), we probably won't be forced to reduce the number of jobs per node, but the runtimes will definitely increase.
Advantage: No loss in any science projects
Problem: Our allocation will not allow us to finish a 10 year survey, the set-up for Run 1.2i is not representative anymore of what we are going to do on Wednesday and beyond and therefore we have no clear picture of the problems awaiting

Plan B: Make an r-band cut at ~27

cosmoDC2: implement this cut, Patricia will still generate the smaller healpixels
imsim set up on Theta: no changes
Advantage: Everything we learned from Run 1.2i holds, we will have probably no real problems on Wednesday, we will stay within our allocation and reduce the storage footprint
Problem: As yet undetermined hit on science projects

Plan C: Reduce the footprint from 300 sq degree to ?? or reduction from 10 years to ?? years

cosmoDC2: no changes
imsim set up on Theta: same worries as in Plan A
Advantage: We will be able to finish within our allocation depending on the reduction
Problem: As yet undetermined hit on science projects

Plan D: Go with Plan B and at a later stage add smaller areas at full depth later All is the same as in Plan B but we might have a lesser hit on the science

Plan E: Make an r-band cut at ~28 and downsample the number of galaxies at > 28 to get back a factor of 5 in memory requirement

cosmoDC2: implement this cut and downsampling, Patricia will still generate the smaller healpixels
imsim set up on Theta: maybe minor changes since we are now down only a factor of 5, not 10
Advantage: We will be able to finish more or less within our allocation
Problem: As yet undetermined hit on science projects

Option E was just suggested by Andrew H. and seems interesting.

As pointed out, in order to make a decision, we need to (a) get a clearer idea of the resource requirements if we go with anything but Plan B and (b) understand which projects in our list will be affected by any plan that is not Plan A.

katrinheitmann commented 5 years ago

And to be very concrete, these are the science projects listed in the planning document that rely on the image simulations. We heard already from SN (@rbiswas4) that they are ok with Plan B. Strong lensing will be not concerned with the WFD simulations. I will ask the conveners and DC2 project leads to add which of their projects will be affected by the different plans. I will do that on a few Slack channels. Sorry for the extra emails.

Galaxy clusters (CL):

[I] Studying how observational systematics affect the performance of cluster-finding algorithms.
[I] Testing the impact of blending and shear estimation biases on cluster mass reconstruction.
[C,I] Analyzing the images and snapshot data to understand the impact of observational systematics on cluster lensing profiles.

Large-scale structure (LSS):

[C,I] Testing LSS analysis pipeline (including angular power spectrum estimation with mode projection for systematics) for selected galaxy subsamples.
[I] Measuring magnification bias in the presence of realistic systematics.
[I] Testing the impact of a selected set of theoretical and observational systematics on two-point clustering measurements.

Weak lensing (WL):

[I] Validating weak lensing shear two-point correlation measurements in the presence of realistic image-level systematics and survey masks.
[C,I] Testing the full end-to-end weak lensing cosmology inference pipeline, from images to cosmological parameters, including marginalization over both observational and theoretical systematics.
[I] Investigating the impact of realistic levels of blending on weak lensing galaxy samples selection and shear biases.
[I] Testing the effectiveness of PSF modeling routines, and the impact of residual systematics on weak lensing.

Sensor Anomalies (SA):

[C,I] Validate the simulation and the correction of the Brighter-Fatter (BF) effect .
[I] Evaluate the acceptable level of BF residual in key analysis .

rmjarvis commented 5 years ago

Can someone please explain why we think we need so much more memory now? In particular, why we can't just increase nside still further. So far, we've only discussed using nside=16 and maybe 32. That's still pixels that are 1.8 degrees across. So way bigger than a sensor.

It seems to me that if nside=16 leads to 10 GB memory requirements (according to Scott, then going to nside=64 is still larger than a sensor (almost a degree across), so we still only need at most 4 loaded at a time. The memory then should be down to around 1 GB or less I suppose.

Is there some technical reason that this is not a tenable strategy?

I'd like this technical point to be addressed before we start weighing in on things that will affect the science we can do with DC2.

cwwalter commented 5 years ago

In addition to Mike's question above on making the catalogs: @jchiang87 for the imSim side what is the main issue?

It sounds like with more depth in Z and mag we have 10x the sources. Is the main problem that we have to store a 10x the sources in memory, which slows things down and means using less cores, or just that we have to simulate that many more sources which grows linearly in time?

For the 1st issue (memory) there should be solutions correct? The "smash with log" way would be leverage the checkpoint system and write the image in steps adding to the previously produced fits file each time. This makes the workflow harder.

A more elegant solution which would probably take longer to implement would be to use some sort of iterator strategy where as you go through the instance catalog generating sources you keep adding to the list in memory as needed removing the old ones with a fix amount memory being used.

katrinheitmann commented 5 years ago

yes, the nside question is not the issue (we can make smaller pixels to make the instance catalog generation code happy), but Chris' questions are very relevant.

jchiang87 commented 5 years ago

I just updated the first comment above regarding the likely memory hit per sensor with these larger catalogs. I haven't done a detailed check, but it seemed from monitoring top output, that the increase per sensor-visit was ~1.5-2GB, which is a factor 2-3 increase, not 10. So memory-wise we are probably ok. The runtime increases I've seen were from running in low fidelity mode where the sensor model (the most costly part of the simulation) was not enabled. In that mode, the runtime clearly scales linearly with number of objects, not number of photons, so we can't really use our previous scaling factors to estimate the extra runtime with the sensor model enabled. In that mode, things will almost certainly scale with number of photons, so probably won't be so bad. I doubt it would be a factor of two in runtime.

cwwalter commented 5 years ago

The really dim objects also will have less photons. So, I suppose the scaling will be dominated by the fixed time for the SED calculations etc / object when we start adding lots of m> 27 objects.

jchiang87 commented 5 years ago

Yes, that's right. The additional fixed cost for the full depth catalog is around 100 mins on Haswell. Applying our usual factor of 6 for KNL, that means ~10 hours, so not quite a factor of two. This doesn't yet include the per photon cost for the ~400k or so additional (fainter) objects. I have the per object photon counts in the log output. I'll post a number based on those data shortly.

aphearin commented 5 years ago

I would like to propose that we divide Run 2.0i into Run 2.0i-shallow and Run 2.0i-deep. This is basically just fleshing out the plan suggested by @dkirkby and @salmanhabib earlier, and is a concrete version of Plan E.

Our allocation comes in a series of a few reservations. For many reasons, the sooner we can have a go with a full run on at least some of the sky area, the better. So for our first reservation this week, let's have a go at ~50 deg**2 using one of the two cuts quantified below. The results of this first run will constitute some or all of Run 2.0i-shallow. This will give us a concrete benchmark for choosing the depth of the catalog used in the remainder of Run 2.0i.

Below are two specific cuts I have experimented with using for this purpose. First things first, here is a DESCQA plot showing the status of the luminosity function in the uncut mock. The plot label 24 < i < 27.5 shows the test validation range - fainter than this and the observational constraints rapidly degrade; for present discussion purposes, think of LF uncertainty as blending rate uncertainty.

hsc_iband_validation

This next plot gives a visual of how the proposed cuts are made. It shows a simple sigmoid down-weighting such that all galaxies with r>~28 or 29 are thrown out; as we go to brighter magnitudes, we smoothly throw out fewer and fewer objects until i<~26.5, at which point we keep all galaxies. This way, we still have plenty of contributions to the blending budget down to r<28-29, even though the blending rates in this regime may be smaller than reality (by some level that cannot be quantified with presently available data). The reduction factor of the size of the galaxy catalog is shown in the legend.

cut_proposal

This next plot shows the stellar masses and redshifts of the galaxies that would be thrown out

mstar_redshift_excluded_galaxies_run2 0i_shallow

Here is the impact on the luminosity function:

luminosity_functions_run2 0i_shallow

Here is the impact on the dn/dmag validation test: dndmag_run2 0i_shallow

rmjarvis commented 5 years ago

I'm confused. From what Jim said, it doesn't seem like we have any reason to think that A isn't completely feasible. I think A should be the baseline plan until we find out that there is definitely some problem with it.

It we need to (or if we are concerned about finishing A), we can start with either less than the full 10 years or less than the full area (i.e. plan C). Then if we run out of allocation before finishing, we will still have something coherent to work with and can fill in the remaining time or area with a second allocation.

But I'm not in favor of reducing the catalog depth without very serious consultation with the science teams. Everyone already signed off on the full depth. Springing this depth reduction on them at the last minute is pretty unfair and is unlikely to be received well. My take is that such depth reduction will adversely affect almost all the science projects for CL, LSS, and WL at least to some extent. (Indeed to what extent is explicitly part of many of these projects -- seeing how undetected objects affect various measurements.)

It is much harder to add back the removed depth later, as compared to adding back missing area or time. Therefore, IMO, plan E (or B or D) should only be done as a desperate measure if it's clear that we will be unable to do anything at all with the current allocation without trimming the input catalog. Fortunately, this appears not to be the case.

aphearin commented 5 years ago

@rmjarvis:

I think A should be the baseline plan until we find out that there is definitely some problem with it.

I definitely agree that we should continue to wait for people to weigh in. The default plan is to prepare the data for Plan A, which is exactly what @patricialarsen is currently doing.

But I'm not in favor of reducing the catalog depth without very serious consultation with the science teams. Everyone already signed off on the full depth. Springing this depth reduction on them at the last minute is pretty unfair and is unlikely to be received well.

I have quite a different viewpoint here. The analysis WGs have expressed many priorities for this mock over the last year in the form of validation tests. The HSC dn/dmag test is the primary contribution we got for how to handle the faint end. There have not been any requests at all for the abundance of galaxies at these depths.

My take is that such depth reduction will adversely affect almost all the science projects for CL, LSS, and WL at least to some extent. (Indeed to what extent is explicitly part of many of these projects -- seeing how undetected objects affect various measurements.

I have quite a different take here as well. My motivation for making the M* vs. redshift scatter plot shows the physical galaxies that are excluded as a result of these cuts. The CL, LSS, and WL science targets have very little contact with galaxies of mass 10^6-8. I keep a pretty close eye on the list of proposed DC2 projects on Confluence, and I do not see any proposed analyses that would be impacted by how galaxies of this mass are modeled.

jchiang87 commented 5 years ago

The ratio of the total number of photons for the 40k brightest objects to the ratio of the remaining guys is ~170 for the test run I did. That means the extra time for drawing the photons for the added faint objects should be <1% the Run1.2i time (i.e., <6 mins on KNL). So the added simulation time is essentially just the sum over the fixed costs per object, or the ~10 hour estimate I noted above. I'll double check the per sensor-visit memory usage with a full fidelity run, so this will take a little while before I have some reliable numbers.

villarrealas commented 5 years ago

I can adjust the bundling algorithm depending on which of these options we choose to go with. But to give a rough estimate for node requirements, we can currently pack up to 10 visits per node, under the assumption that each imSim container costs 10 GB of memory and each thread needs 1 GB of memory.

So: let's assume 6 GB gets added onto the imSim container (the instance catalog gets doubled with no reduction due to change of precision and the like) and we go up to 2 GB of memory required per thread (as a worse case scenario). We then go down to safely packing 4 imSim containers onto a node from 10. This probably does cost us quite a bit in efficiency, though I don't have hard numbers to support that.

On the flip side, if the instance catalog loses even 3 GB of size (and we go back to our 1 GB per thread assumption), we can easily pack around 18 imSim containers on one node, which makes us more robust to jobs with a small number of remaining sensors.

In any case, the bundler takes both the threads/node and the instance/node as arguments, so that part of the workflow will operate just fine without it. But I think the efficiency argument is worth considering.

jchiang87 commented 5 years ago

It looks like the memory per sensor-visit will indeed be right around 2GB (with a ~30 min transient state before any objects are drawn when the memory increases by 0.4GB while the sky bg is being rendered). The memory for each ImageSimulator instance will probably go down by 3GB (from ~10GB) with the new release using galsim v2.0. I'll verify that when I do the image comparisons against the new release.

villarrealas commented 5 years ago

In the case that the ImageSimulator instance goes down to 3GB (even with the memory per sensor-visit being up), we should not be particularly limited on the workflow side, at least.

cwwalter commented 5 years ago

Jim, just to make sure everyone is on the same page for the workflow strategy: will it "go down by 3GB" as in your note (from 10 to 7) or "down to 3GB" as in Antonio's note?

rmandelb commented 5 years ago

All: I just came online and found this and the previous discussion. I realize that several of the people in this thread (Jim and others) are still assessing the performance impact of the new catalog, so the estimates related to Plan A may still change. But let me add a few comments:

The WL, CL, LSS analyses generally depend on having a fairly realistic galaxy abundance below the depth of the samples they want to use for science, so as to include a realistic level of blending challenge (as was articulated amongst their DC2 original science goals). To give an example of why this is important for WL and CL: we already know that metacalibration, which is a candidate LSST shear algorithm that will be tested on DC2, can perform essentially at the level of precision we need when it's used to analyze isolated galaxies. This is known based on some simple and not-so-simple simulations and work on DES data. Exercising metacalibration in a simulation with realistic levels of blending is therefore the next step in getting it ready for LSST, by quantifying what blending does and testing schemes for mitigation of blending effects on shear. And we already know from HSC and other surveys that many blending challenges arise due to blend systems consisting of the galaxies used for science and those that might not make it into the sample yet are detected or just below the detection threshold.
However, the key question I would ask is whether this means 2 mags below the LSST detection limit or 2 mags below the limits of the shallower samples we plan to use for science? The difference between "2 mags fainter than the LSST gold sample" and "2 mags fainter than the LSST 10-year coadd detection threshold" is of course substantial. Comments, @egawiser @dkirkby @rmjarvis @esheldon @anjavdl @idellant @erozo ? If this is going to take some back-and-forth to understand, I am happy to set up a slack channel for that purpose (better than GitHub for this kind of thing). Making a substantial reduction in the input catalog requires a focused discussion of the impact on this science goal.
I would like to note that we previously had a discussion of possible descopes, admittedly in the context of Run 1.2p rather than Run 2.0 (the decision was later made to not descope, so the issue was closed, but we did gather input from all analysis WGs at the time re: fewer years vs. less area, see #145 ), and the preference was for a survey with fewer years rather than smaller area. In some cases this question was answered in a way that suggested it was true for all of DC2 in general, rather than just Run 1.2. I wanted to point this out because you can find some feedback in that thread that is relevant to plan A (assuming that it might not finish to Y10 depth) and plan C (suggesting a preference for fewer years rather than less area). Based on that feedback received, I would suggest that if we cannot achieve convergence on the question of cutting the input catalog within the coming days, and if we definitely cannot go to Y10 with the current input catalog, then the default with the least science impact would be to simulate fewer years. This statement is based on the idea that we'd get to ~Y3-Y5, which as I understand would be feasible in plan A. If it's shallower than that, then I would not rely on this being OK, and again more discussion is needed.

jchiang87 commented 5 years ago

will it "go down by 3GB" as in your note (from 10 to 7) or "down to 3GB" as in Antonio's note?

The atmospheric screen memory appears to be being reduced by 3GB in the new release. However, note that the x10 larger instance catalogs for cosmoDC2 will probably eradicate that 3GB gain and make things even worse, something I hadn't appreciated until just now. How much worse is a little unclear to me at the moment.

rmjarvis commented 5 years ago

However, note that the x10 larger instance catalogs for cosmoDC2 will probably eradicate that 3GB gain and make things even worse,

I still don't understand why this is necessarily true. Can't we just use a larger nside? Katrin said this want's relevant, but didn't explain. Can someone please clear this up?

cwwalter commented 5 years ago

I think the nside issue is related to instance catalog generation. This is for actually using them in imSim as the processes need to hold the object list in memory.

rmjarvis commented 5 years ago

But if nside is larger, isn't there less to hold in memory? Or is it already only holding in memory the portion of the instance catalog that overlaps the sensor?

jchiang87 commented 5 years ago

nside has nothing to do with the size of the instance catalogs. Those are generated for the entire FOV then read into imsim. We could generate them for parts of the FOV, e.g., per sensor, but that's not currently how the code is structured, i.e., back when a full fov of instance catalog objects was 10 times smaller, reading the whole fov into memory wasn't a big deal. Now it probably is, so we'd need to change the code to accommodate this factor of 10. In any case, nside isn't relevant in the imsim context.

rmjarvis commented 5 years ago

I see. I hadn't understood how the object catalog for a single sensor could possibly be a memory concern, so I assumed you must be reading in catalogs in these healpix pixels that people were talking about. But if it's the full fov, I can see how that could be problematic.

It seems like it wouldn't be that hard to write separate instance catalogs for each sensor rather than for each full focal plane. Or alternatively to cull the input catalog immediately after reading in to only hold the portion that is relevant for the sensor being rendered. Are either of these possible to do on the time scale of Wednesday?

jchiang87 commented 5 years ago

It seems like it wouldn't be that hard to write separate instance catalogs for each sensor rather than for each full focal plane. Or alternatively to cull the input catalog immediately after reading in to only hold the portion that is relevant for the sensor being rendered. Are either of these possible to do on the time scale of Wednesday?

I would think so, though fixing this particular issue isn't likely to be the hardest thing to do before Wednesday.

rmandelb commented 5 years ago

All - based on my comments on science impact of trimming the catalogs (https://github.com/LSSTDESC/DC2-production/issues/263#issuecomment-419686410), I'd like to suggest a more limited and slightly modified set of scenarios compared to Katrin's A-E. Again I realize the situation with respect to her plan A is still evolving, so further technical and science discussion is needed:

Plan A: identical to Katrin's Plan A.
Plan B+: similar to Katrin's plan B, but r<28 rather than r<27. The reason for this is outlined in my comment above: there are high-priority science cases that clearly require an input catalog that is 2 mags deeper than the nominal WL and LSS samples. Those go to i<~25.3, so 2 mags deeper is i<~27.3, which is r<~28 given typical galaxy colors. Going shallower than this would seriously impact long-planned science projects.

What I am doing with plan B+ is essentially suggesting that we can enable the high-priority science projects that were already defined for WL, LSS, CL by including realistic levels of blending with/around the nominal samples we have planned to use - while abandoning the option of using DC2 to investigate more permissive sample selections (because if we go to a deeper sample, we no longer have galaxies in the input catalog that are >2 mag deeper than that more permissive sample). Investigating more permissive sample selection in a realistic way would then be deferred to DC3. This compromise would allow us to better work within DC2 resource limitations while enabling the planned science projects, without providing as much scope as originally planned for additional science projects related to sample selection beyond that. Adopting Plan B+ requires some discussion; I am setting up a slack channel for this and will invite some people to comment on science impact within the coming day.

Plan C-: a subset of Katrin's plan C, focusing on fewer years rather than less area. We've already received feedback on this tradeoff in #145 between fewer years vs. less area so I am proposing we should focus on the "fewer years" scenario rather than the "less area" scenario if we cannot achieve convergence on limiting the input catalog. I think plan C- is what we expect to happen by default if we go with plan A and simulate the visits in time order until the allocation runs out, unless there is some subtlety that I'm missing? If we are unable to achieve convergence on the viability of plan B+ on the right timescale, then given the stated priorities of multiple working groups, I think plan C- is the way we have to go.

rmandelb commented 5 years ago

Note, slack channel for discussion of the science impact of Plan B+ is #desc-dc2-catlimits ; it's a public channel and anyone is welcome to join. The discussion will be summarized back here, but this way there won't be tons of e-mail notifications for all the inevitable back-and-forth.

jchiang87 commented 5 years ago

It turns out that the inference I made above regarding the fixed cost per object being the same for the low fidelity (sensor model disabled, double Gaussian PSF) sims (hereafter "fast_sims") and the full fidelity sims ("slow_sims") was not correct. The full fidelity sims do several additional calculations that are independent of the number of photons being drawn, e.g., computing postage stamp regions over which the object is drawn, constructing samplers of the SED and incident angles for surface operations, etc..

The cost of those additional operations per object amount to something like a factor ~7 wrt the fast_sims case. Here's a plot of the elapsed runtime vs # of objects drawn for the single sensor test runs I did yesterday using an instance catalog created with the cosmodc2 0.4 catalog (~460k objects): runtime_vs_objects_drawn

This means that instead of the ~10-12 hour runtimes for Run1.2i on KNL, we are looking at ~70-80 hour runtimes for Run2.0i.

cwwalter commented 5 years ago

I wonder if there aren't ways to really speed up some of the fixed-cost steps that are now becoming dominant. For example, for the m>27 case maybe we can just know what the size of the postage stamp should be. Do we have a new profile?

jchiang87 commented 5 years ago

I'm profiling the code now using a pared-down instance catalog with just these faint objects.

jchiang87 commented 5 years ago

I ran 5901 of these faint objects with the sensor model enabled (but using the Kolmogorov PSF to avoid having the initial atmospheric screen calculations dominate the runtimes for this many objects. I also disabled the sky background rendering for the same reason.). Here are the top 20 time-sorted profile entries:

In [3]: slow_stats.sort_stats('time').print_stats(20)
Sun Sep  9 15:58:00 2018    slow_sim.prof            

         546529588 function calls (526197265 primitive calls) in 573.040 seconds

   Ordered by: internal time
   List reduced from 11807 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 85492212   83.674    0.000   92.626    0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:721(floatconv)
    11346   50.758    0.004  301.951    0.027 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:748(loadtxt)  
35942058/17945526   31.640    0.000   34.078    0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:935(pack_items)
    45368   30.291    0.001   30.291    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:666(_xValue)                                                                                                                                                     
 17956908   25.218    0.000   80.284    0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:951(split_line)              
 17945526   23.925    0.000  116.551    0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:1024(<listcomp>)             
  1439944   18.714    0.000   19.446    0.000 {built-in method numpy.core.multiarray.array}                                                                                            
 35937124   17.720    0.000   31.298    0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/compat/py3k.py:32(asbytes)                
    21529   15.782    0.001   17.054    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)                                                                                                                                                        
 17961342   13.251    0.000   13.251    0.000 {method 'split' of '_sre.SRE_Pattern' objects}                                                                                           
41158812/41145794    9.041    0.000   10.432    0.000 {built-in method builtins.isinstance}                                                                                            
 85492215    8.952    0.000    8.952    0.000 {method 'lower' of 'bytes' objects}                                                                                                      
    11802    7.283    0.001    7.384    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)                                                                                                                                      
    11694    7.239    0.001    7.301    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)                                                                                                                                                       
 20381207    7.214    0.000    7.214    0.000 {method 'split' of 'bytes' objects}                                                                                                      
     5671    6.995    0.001    9.727    0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)        
 17982359    6.500    0.000    6.500    0.000 {method 'encode' of 'str' objects}                                                                                                       
    17820    6.317    0.000    6.317    0.000 {palpy.mappa}                                                                                                                            
     1506    5.368    0.004    5.368    0.004 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/sensor.py:247(accumulate)                                                                                                                                                    
 40825313    4.920    0.000    4.920    0.000 {method 'append' of 'list' objects}

This only took 10 mins, so I'm doing a longer run with x6 objects.

jchiang87 commented 5 years ago

Implementing Mike's suggestion to cache the SiliconSensor objects, we've reduced the runtime for these faint objects by ~58%. Here's the revised profile (for ~23k objects):

In [9]: new_stats.sort_stats('time').print_stats(20)
Sun Sep  9 17:43:40 2018    slow_sim_40k_sensor_cache.prof

         215901925 function calls (207843695 primitive calls) in 913.135 seconds

   Ordered by: internal time
   List reduced from 11818 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   180168  123.694    0.001  123.694    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:666(_xValue)
    72877   68.041    0.001   73.077    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
    59632   38.489    0.001   38.794    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
    46752   28.914    0.001   29.267    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
    22521   26.874    0.001   34.217    0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
  5526170   24.939    0.000   27.978    0.000 {built-in method numpy.core.multiarray.array}
    68370   23.706    0.000   23.706    0.000 {palpy.mappa}
    22520   16.579    0.001   16.579    0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:151(<listcomp>)
    22520   16.116    0.001   16.116    0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:152(<listcomp>)
   679850   15.251    0.000   79.044    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:232(evaluate_xy)
      206   13.951    0.068   34.480    0.167 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:555(pixelCoordsFromPupilCoords)
    68372   13.517    0.000   13.517    0.000 {palpy.aoppa}
    45041   12.078    0.000  119.470    0.003 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:665(pixelCoordsFromPupilCoordsLSST)
    67961   11.472    0.000   91.819    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstZernikeFitter.py:286(_apply_transformation)
   454640   10.955    0.000   27.727    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:136(_evaluate_radial_array)
    97962   10.771    0.000   10.771    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:2049(shoot)
      205   10.082    0.049   10.082    0.049 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:610(<listcomp>)
    22520    9.907    0.000    9.907    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:747(<listcomp>)
    45040    9.531    0.000    9.531    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:207(getArgs)
   679850    9.081    0.000   48.834    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:196(evaluate)

and using a fixed stamp size for objects with flux < 10, gets us an additional ~10% speed-up. The profile with these two changes applied looks like:

In [16]: new_stats.sort_stats('time').print_stats(20)
Sun Sep  9 18:22:10 2018    slow_sim_40k_stamp_fix.prof

         210748426 function calls (202690252 primitive calls) in 804.147 seconds

   Ordered by: internal time
   List reduced from 11818 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    72877   69.075    0.001   74.337    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
    59632   38.703    0.001   39.012    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
    46752   29.366    0.001   29.721    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
    22521   27.911    0.001   34.995    0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
  5301130   24.465    0.000   27.465    0.000 {built-in method numpy.core.multiarray.array}
    68370   24.453    0.000   24.453    0.000 {palpy.mappa}
    22520   17.792    0.001   17.792    0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:152(<listcomp>)
    22520   17.646    0.001   17.646    0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:151(<listcomp>)
   679850   14.601    0.000   74.982    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:232(evaluate_xy)
      206   14.062    0.068   34.867    0.169 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:555(pixelCoordsFromPupilCoords)
    68372   13.686    0.000   13.686    0.000 {palpy.aoppa}
    45041   12.414    0.000  119.126    0.003 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:665(pixelCoordsFromPupilCoordsLSST)
    67961   11.630    0.000   87.877    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstZernikeFitter.py:286(_apply_transformation)
    97962   10.782    0.000   10.782    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:2049(shoot)
   454640   10.420    0.000   26.118    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:136(_evaluate_radial_array)
      205   10.272    0.050   10.272    0.050 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:610(<listcomp>)
    22520   10.143    0.000   10.143    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:747(<listcomp>)
    45040    9.948    0.000    9.948    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:207(getArgs)
  418/269    9.798    0.023   10.533    0.039 {built-in method _imp.create_dynamic}
    22520    8.796    0.000    8.796    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:989(<listcomp>)

egawiser commented 5 years ago

@jchiang87 thanks for these extremely helpful profiles and updates! Did I understand correctly from your discussion with @rmjarvis above from yesterday that the current implementation of imsim uses an instance catalog that corresponds to the entire FOV, even though we only simulate one sensor at a time? If so, does the code perform all of the time-limiting per-object calculations on every object in that instance catalog? And if so, wouldn't even a very rough culling down to a circle of conservative radius around the sensor pointing center get us a savings of order 100X, given that the FOV has 189 sensors? Apologies if I'm missing something basic, but otherwise this would strike me as a promising place to improve what sounds like an inefficient use of CPU time and memory.

jchiang87 commented 5 years ago

The code ingests the data from the instance catalog but defers all of the time-intensive computation until the simulation for any given sensor begins. I actually spent a lot of time trying to get the memory and cpu usage to be as manageable as possible where it made the most difference for the Run1.2i-sized instance catalogs. The original code was structured to read in the entire fov instance catalog because it was meant to take the same inputs as phosim. Phosim does the per sensor trimming by writing the sensor-sized catalogs to disk, whereas for imsim, that was done in memory. When the object lists were x10 smaller that was manageable, i.e. a few GB total. A factor of 10 with Run2.0, and 30GB becomes important. In any case, 100X is not in the cards.

jchiang87 commented 5 years ago

Fits images before changes and after the sensor caching: lsst_e_219976_R22_S11_r.fits.gz lsst_e_219976_R22_S11_r_sensor_cache.fits.gz

jchiang87 commented 5 years ago

With @rmjarvis' guidance, I think we've managed to speed things up by a factor of 3 to 4 in the faint object rendering. Here's the main PR with the changes, and here's the profile output after these changes:

stats.sort_stats('time').print_stats(20)
Mon Sep 10 18:36:03 2018    slow_sim_gs_bandpass.prof

         130735048 function calls (127073037 primitive calls) in 535.756 seconds

   Ordered by: internal time
   List reduced from 11804 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    36984   38.416    0.001   41.072    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
    46752   29.479    0.001   29.839    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
    36981   26.584    0.001   26.788    0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
    22519   16.364    0.001   16.364    0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:151(<listcomp>)
    22519   16.275    0.001   16.275    0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:152(<listcomp>)
  3096661   14.632    0.000   15.875    0.000 {built-in method numpy.core.multiarray.array}
      206   14.487    0.070   35.931    0.174 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:555(pixelCoordsFromPupilCoords)
     9246   10.898    0.001   14.588    0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
    31765   10.871    0.000  103.761    0.003 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:665(pixelCoordsFromPupilCoordsLSST)
      205   10.623    0.052   10.623    0.052 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:610(<listcomp>)
   414340   10.597    0.000   53.758    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:232(evaluate_xy)
    97863   10.322    0.000   10.322    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:2049(shoot)
    22519   10.107    0.000   10.107    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:747(<listcomp>)
    28545   10.038    0.000   10.038    0.000 {palpy.mappa}
   321880    9.009    0.000   20.977    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:136(_evaluate_radial_array)
    41410    8.537    0.000   63.184    0.002 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstZernikeFitter.py:286(_apply_transformation)
    22519    8.206    0.000    8.206    0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:989(<listcomp>)
      207    7.580    0.037    7.580    0.037 {built-in method numpy.core.multiarray._vec_string}
    22519    7.454    0.000    7.454    0.000 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:359(<listcomp>)
    31764    6.994    0.000  111.381    0.004 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimCameraWrapper.py:509(pixelCoordsFromPupilCoords)

I'll redo the full sensor-visit to check the final speedup.

rmandelb commented 5 years ago

Since this issue may be a more useful reference than the #desc-dc2-catlimits slack channel, I would like to record the outcome of that discussion. I'll do this in the form of a short set of bullets with links to the associated slack comment for those who want more detail:

The outcome is that we will go with proposal A (or effectively C-): keep the catalogs as-is, and simulate as many years of Run 2.0i WFD as we can given the available resources (3-5 years depending on how things go). This choice was enabled in part by multiple optimizations described in this GitHub thread, see Jim Chiang's comments above.
Qualitative science justification: Mike Jarvis provided an expanded version of the science justification for wanting objects below the detection limit for WL here; Ian Dell'Antonio responded immediately after that with the justification for CL. Both were based on previous studies that justify the need for an input catalog below LSST depth, but with some uncertainty as to how far below is needed. Michael Wood-Vasey produced an NSF-style broader impact statement beyond WL and CL here.
Quantitative science justification: There was agreement that we'd like in future to better quantify the needed input catalog depth, but we don't have tools available to do so now (DC2 might become a way to do that, to feed into future simulation efforts, which should consider the input catalog when evaluating resource constraints at an earlier stage of the process if possible).
Additional nuances: Anze commented on area vs. depth tradeoffs and the connection between input catalog depth vs. how many years of survey are simulated (among other issues) here. Multiple people (Andrew Hearin, Eric Gawiser, and others) commented on the potential uncertainties in the simulated faint galaxy population starting around here.
Sanity check of faint galaxy number counts: I provided a comparison with published number counts from pencil-beam surveys, concluding as follows: "when compared with very deep pencil-beam surveys in i-band, the cosmoDC2 cumulative number counts agree with those deep pencil-beam surveys to ~10% at I<26.5 and I<28 (better than I had hoped for!). In z-band, the agreement seems to be 10% at z<24 and more like 50% at z<28." See this post for obnoxious amounts of detail and references.

rmandelb commented 5 years ago

@katrinheitmann - I am closing this issue, but feel free to re-open if needed.

jchiang87 commented 5 years ago

For those interested in the final summary of the code changes to speed up the runtime, I have a comment here in the imSim repo.

katrinheitmann commented 5 years ago

While this issue has been already closed a while ago, the question did get revisited later and the conclusion that was stated above was overturned (to not make any cuts). As discussed in the DC2 telecon on Nov. 14, the final decision was to make a cut in r~29 (https://docs.google.com/presentation/d/1FekyEB5lEMy7m8Jst4_N5pK-a_iWsL-Po3Wx7E9wA-E/edit#slide=id.g47b4924efc_0_0). This was finally implemented in Run 2.1i and is the cut in the DC2 image simulations. However, cosmoDC2_440 has no cuts, so the first part of plan A still holds.