Closed katrinheitmann closed 5 years ago
And to be very concrete, these are the science projects listed in the planning document that rely on the image simulations. We heard already from SN (@rbiswas4) that they are ok with Plan B. Strong lensing will be not concerned with the WFD simulations. I will ask the conveners and DC2 project leads to add which of their projects will be affected by the different plans. I will do that on a few Slack channels. Sorry for the extra emails.
Galaxy clusters (CL):
Large-scale structure (LSS):
Weak lensing (WL):
Sensor Anomalies (SA):
Can someone please explain why we think we need so much more memory now? In particular, why we can't just increase nside still further. So far, we've only discussed using nside=16 and maybe 32. That's still pixels that are 1.8 degrees across. So way bigger than a sensor.
It seems to me that if nside=16 leads to 10 GB memory requirements (according to Scott, then going to nside=64 is still larger than a sensor (almost a degree across), so we still only need at most 4 loaded at a time. The memory then should be down to around 1 GB or less I suppose.
Is there some technical reason that this is not a tenable strategy?
I'd like this technical point to be addressed before we start weighing in on things that will affect the science we can do with DC2.
In addition to Mike's question above on making the catalogs: @jchiang87 for the imSim side what is the main issue?
It sounds like with more depth in Z and mag we have 10x the sources. Is the main problem that we have to store a 10x the sources in memory, which slows things down and means using less cores, or just that we have to simulate that many more sources which grows linearly in time?
For the 1st issue (memory) there should be solutions correct? The "smash with log" way would be leverage the checkpoint system and write the image in steps adding to the previously produced fits file each time. This makes the workflow harder.
A more elegant solution which would probably take longer to implement would be to use some sort of iterator strategy where as you go through the instance catalog generating sources you keep adding to the list in memory as needed removing the old ones with a fix amount memory being used.
yes, the nside question is not the issue (we can make smaller pixels to make the instance catalog generation code happy), but Chris' questions are very relevant.
I just updated the first comment above regarding the likely memory hit per sensor with these larger catalogs. I haven't done a detailed check, but it seemed from monitoring top
output, that the increase per sensor-visit was ~1.5-2GB, which is a factor 2-3 increase, not 10. So memory-wise we are probably ok. The runtime increases I've seen were from running in low fidelity mode where the sensor model (the most costly part of the simulation) was not enabled. In that mode, the runtime clearly scales linearly with number of objects, not number of photons, so we can't really use our previous scaling factors to estimate the extra runtime with the sensor model enabled. In that mode, things will almost certainly scale with number of photons, so probably won't be so bad. I doubt it would be a factor of two in runtime.
The really dim objects also will have less photons. So, I suppose the scaling will be dominated by the fixed time for the SED calculations etc / object when we start adding lots of m> 27 objects.
Yes, that's right. The additional fixed cost for the full depth catalog is around 100 mins on Haswell. Applying our usual factor of 6 for KNL, that means ~10 hours, so not quite a factor of two. This doesn't yet include the per photon cost for the ~400k or so additional (fainter) objects. I have the per object photon counts in the log output. I'll post a number based on those data shortly.
I would like to propose that we divide Run 2.0i into Run 2.0i-shallow and Run 2.0i-deep. This is basically just fleshing out the plan suggested by @dkirkby and @salmanhabib earlier, and is a concrete version of Plan E.
Our allocation comes in a series of a few reservations. For many reasons, the sooner we can have a go with a full run on at least some of the sky area, the better. So for our first reservation this week, let's have a go at ~50 deg**2 using one of the two cuts quantified below. The results of this first run will constitute some or all of Run 2.0i-shallow. This will give us a concrete benchmark for choosing the depth of the catalog used in the remainder of Run 2.0i.
Below are two specific cuts I have experimented with using for this purpose. First things first, here is a DESCQA plot showing the status of the luminosity function in the uncut mock. The plot label 24 < i < 27.5 shows the test validation range - fainter than this and the observational constraints rapidly degrade; for present discussion purposes, think of LF uncertainty as blending rate uncertainty.
This next plot gives a visual of how the proposed cuts are made. It shows a simple sigmoid down-weighting such that all galaxies with r>~28 or 29 are thrown out; as we go to brighter magnitudes, we smoothly throw out fewer and fewer objects until i<~26.5, at which point we keep all galaxies. This way, we still have plenty of contributions to the blending budget down to r<28-29, even though the blending rates in this regime may be smaller than reality (by some level that cannot be quantified with presently available data). The reduction factor of the size of the galaxy catalog is shown in the legend.
This next plot shows the stellar masses and redshifts of the galaxies that would be thrown out
Here is the impact on the luminosity function:
Here is the impact on the dn/dmag validation test:
I'm confused. From what Jim said, it doesn't seem like we have any reason to think that A isn't completely feasible. I think A should be the baseline plan until we find out that there is definitely some problem with it.
It we need to (or if we are concerned about finishing A), we can start with either less than the full 10 years or less than the full area (i.e. plan C). Then if we run out of allocation before finishing, we will still have something coherent to work with and can fill in the remaining time or area with a second allocation.
But I'm not in favor of reducing the catalog depth without very serious consultation with the science teams. Everyone already signed off on the full depth. Springing this depth reduction on them at the last minute is pretty unfair and is unlikely to be received well. My take is that such depth reduction will adversely affect almost all the science projects for CL, LSS, and WL at least to some extent. (Indeed to what extent is explicitly part of many of these projects -- seeing how undetected objects affect various measurements.)
It is much harder to add back the removed depth later, as compared to adding back missing area or time. Therefore, IMO, plan E (or B or D) should only be done as a desperate measure if it's clear that we will be unable to do anything at all with the current allocation without trimming the input catalog. Fortunately, this appears not to be the case.
@rmjarvis:
I think A should be the baseline plan until we find out that there is definitely some problem with it.
I definitely agree that we should continue to wait for people to weigh in. The default plan is to prepare the data for Plan A, which is exactly what @patricialarsen is currently doing.
But I'm not in favor of reducing the catalog depth without very serious consultation with the science teams. Everyone already signed off on the full depth. Springing this depth reduction on them at the last minute is pretty unfair and is unlikely to be received well.
I have quite a different viewpoint here. The analysis WGs have expressed many priorities for this mock over the last year in the form of validation tests. The HSC dn/dmag test is the primary contribution we got for how to handle the faint end. There have not been any requests at all for the abundance of galaxies at these depths.
My take is that such depth reduction will adversely affect almost all the science projects for CL, LSS, and WL at least to some extent. (Indeed to what extent is explicitly part of many of these projects -- seeing how undetected objects affect various measurements.
I have quite a different take here as well. My motivation for making the M* vs. redshift scatter plot shows the physical galaxies that are excluded as a result of these cuts. The CL, LSS, and WL science targets have very little contact with galaxies of mass 10^6-8. I keep a pretty close eye on the list of proposed DC2 projects on Confluence, and I do not see any proposed analyses that would be impacted by how galaxies of this mass are modeled.
The ratio of the total number of photons for the 40k brightest objects to the ratio of the remaining guys is ~170 for the test run I did. That means the extra time for drawing the photons for the added faint objects should be <1% the Run1.2i time (i.e., <6 mins on KNL). So the added simulation time is essentially just the sum over the fixed costs per object, or the ~10 hour estimate I noted above. I'll double check the per sensor-visit memory usage with a full fidelity run, so this will take a little while before I have some reliable numbers.
I can adjust the bundling algorithm depending on which of these options we choose to go with. But to give a rough estimate for node requirements, we can currently pack up to 10 visits per node, under the assumption that each imSim container costs 10 GB of memory and each thread needs 1 GB of memory.
So: let's assume 6 GB gets added onto the imSim container (the instance catalog gets doubled with no reduction due to change of precision and the like) and we go up to 2 GB of memory required per thread (as a worse case scenario). We then go down to safely packing 4 imSim containers onto a node from 10. This probably does cost us quite a bit in efficiency, though I don't have hard numbers to support that.
On the flip side, if the instance catalog loses even 3 GB of size (and we go back to our 1 GB per thread assumption), we can easily pack around 18 imSim containers on one node, which makes us more robust to jobs with a small number of remaining sensors.
In any case, the bundler takes both the threads/node and the instance/node as arguments, so that part of the workflow will operate just fine without it. But I think the efficiency argument is worth considering.
It looks like the memory per sensor-visit will indeed be right around 2GB (with a ~30 min transient state before any objects are drawn when the memory increases by 0.4GB while the sky bg is being rendered). The memory for each ImageSimulator
instance will probably go down by 3GB (from ~10GB) with the new release using galsim v2.0. I'll verify that when I do the image comparisons against the new release.
In the case that the ImageSimulator instance goes down to 3GB (even with the memory per sensor-visit being up), we should not be particularly limited on the workflow side, at least.
Jim, just to make sure everyone is on the same page for the workflow strategy: will it "go down by 3GB" as in your note (from 10 to 7) or "down to 3GB" as in Antonio's note?
All: I just came online and found this and the previous discussion. I realize that several of the people in this thread (Jim and others) are still assessing the performance impact of the new catalog, so the estimates related to Plan A may still change. But let me add a few comments:
The WL, CL, LSS analyses generally depend on having a fairly realistic galaxy abundance below the depth of the samples they want to use for science, so as to include a realistic level of blending challenge (as was articulated amongst their DC2 original science goals). To give an example of why this is important for WL and CL: we already know that metacalibration, which is a candidate LSST shear algorithm that will be tested on DC2, can perform essentially at the level of precision we need when it's used to analyze isolated galaxies. This is known based on some simple and not-so-simple simulations and work on DES data. Exercising metacalibration in a simulation with realistic levels of blending is therefore the next step in getting it ready for LSST, by quantifying what blending does and testing schemes for mitigation of blending effects on shear. And we already know from HSC and other surveys that many blending challenges arise due to blend systems consisting of the galaxies used for science and those that might not make it into the sample yet are detected or just below the detection threshold.
However, the key question I would ask is whether this means 2 mags below the LSST detection limit or 2 mags below the limits of the shallower samples we plan to use for science? The difference between "2 mags fainter than the LSST gold sample" and "2 mags fainter than the LSST 10-year coadd detection threshold" is of course substantial. Comments, @egawiser @dkirkby @rmjarvis @esheldon @anjavdl @idellant @erozo ? If this is going to take some back-and-forth to understand, I am happy to set up a slack channel for that purpose (better than GitHub for this kind of thing). Making a substantial reduction in the input catalog requires a focused discussion of the impact on this science goal.
I would like to note that we previously had a discussion of possible descopes, admittedly in the context of Run 1.2p rather than Run 2.0 (the decision was later made to not descope, so the issue was closed, but we did gather input from all analysis WGs at the time re: fewer years vs. less area, see #145 ), and the preference was for a survey with fewer years rather than smaller area. In some cases this question was answered in a way that suggested it was true for all of DC2 in general, rather than just Run 1.2. I wanted to point this out because you can find some feedback in that thread that is relevant to plan A (assuming that it might not finish to Y10 depth) and plan C (suggesting a preference for fewer years rather than less area). Based on that feedback received, I would suggest that if we cannot achieve convergence on the question of cutting the input catalog within the coming days, and if we definitely cannot go to Y10 with the current input catalog, then the default with the least science impact would be to simulate fewer years. This statement is based on the idea that we'd get to ~Y3-Y5, which as I understand would be feasible in plan A. If it's shallower than that, then I would not rely on this being OK, and again more discussion is needed.
will it "go down by 3GB" as in your note (from 10 to 7) or "down to 3GB" as in Antonio's note?
The atmospheric screen memory appears to be being reduced by 3GB in the new release. However, note that the x10 larger instance catalogs for cosmoDC2 will probably eradicate that 3GB gain and make things even worse, something I hadn't appreciated until just now. How much worse is a little unclear to me at the moment.
However, note that the x10 larger instance catalogs for cosmoDC2 will probably eradicate that 3GB gain and make things even worse,
I still don't understand why this is necessarily true. Can't we just use a larger nside? Katrin said this want's relevant, but didn't explain. Can someone please clear this up?
I think the nside issue is related to instance catalog generation. This is for actually using them in imSim as the processes need to hold the object list in memory.
But if nside is larger, isn't there less to hold in memory? Or is it already only holding in memory the portion of the instance catalog that overlaps the sensor?
nside has nothing to do with the size of the instance catalogs. Those are generated for the entire FOV then read into imsim. We could generate them for parts of the FOV, e.g., per sensor, but that's not currently how the code is structured, i.e., back when a full fov of instance catalog objects was 10 times smaller, reading the whole fov into memory wasn't a big deal. Now it probably is, so we'd need to change the code to accommodate this factor of 10. In any case, nside isn't relevant in the imsim context.
I see. I hadn't understood how the object catalog for a single sensor could possibly be a memory concern, so I assumed you must be reading in catalogs in these healpix pixels that people were talking about. But if it's the full fov, I can see how that could be problematic.
It seems like it wouldn't be that hard to write separate instance catalogs for each sensor rather than for each full focal plane. Or alternatively to cull the input catalog immediately after reading in to only hold the portion that is relevant for the sensor being rendered. Are either of these possible to do on the time scale of Wednesday?
It seems like it wouldn't be that hard to write separate instance catalogs for each sensor rather than for each full focal plane. Or alternatively to cull the input catalog immediately after reading in to only hold the portion that is relevant for the sensor being rendered. Are either of these possible to do on the time scale of Wednesday?
I would think so, though fixing this particular issue isn't likely to be the hardest thing to do before Wednesday.
All - based on my comments on science impact of trimming the catalogs (https://github.com/LSSTDESC/DC2-production/issues/263#issuecomment-419686410), I'd like to suggest a more limited and slightly modified set of scenarios compared to Katrin's A-E. Again I realize the situation with respect to her plan A is still evolving, so further technical and science discussion is needed:
Plan A: identical to Katrin's Plan A.
Plan B+: similar to Katrin's plan B, but r<28 rather than r<27. The reason for this is outlined in my comment above: there are high-priority science cases that clearly require an input catalog that is 2 mags deeper than the nominal WL and LSS samples. Those go to i<~25.3, so 2 mags deeper is i<~27.3, which is r<~28 given typical galaxy colors. Going shallower than this would seriously impact long-planned science projects.
What I am doing with plan B+ is essentially suggesting that we can enable the high-priority science projects that were already defined for WL, LSS, CL by including realistic levels of blending with/around the nominal samples we have planned to use - while abandoning the option of using DC2 to investigate more permissive sample selections (because if we go to a deeper sample, we no longer have galaxies in the input catalog that are >2 mag deeper than that more permissive sample). Investigating more permissive sample selection in a realistic way would then be deferred to DC3. This compromise would allow us to better work within DC2 resource limitations while enabling the planned science projects, without providing as much scope as originally planned for additional science projects related to sample selection beyond that. Adopting Plan B+ requires some discussion; I am setting up a slack channel for this and will invite some people to comment on science impact within the coming day.
Note, slack channel for discussion of the science impact of Plan B+ is #desc-dc2-catlimits ; it's a public channel and anyone is welcome to join. The discussion will be summarized back here, but this way there won't be tons of e-mail notifications for all the inevitable back-and-forth.
It turns out that the inference I made above regarding the fixed cost per object being the same for the low fidelity (sensor model disabled, double Gaussian PSF) sims (hereafter "fast_sims") and the full fidelity sims ("slow_sims") was not correct. The full fidelity sims do several additional calculations that are independent of the number of photons being drawn, e.g., computing postage stamp regions over which the object is drawn, constructing samplers of the SED and incident angles for surface operations, etc..
The cost of those additional operations per object amount to something like a factor ~7 wrt the fast_sims case. Here's a plot of the elapsed runtime vs # of objects drawn for the single sensor test runs I did yesterday using an instance catalog created with the cosmodc2 0.4 catalog (~460k objects):
This means that instead of the ~10-12 hour runtimes for Run1.2i on KNL, we are looking at ~70-80 hour runtimes for Run2.0i.
I wonder if there aren't ways to really speed up some of the fixed-cost steps that are now becoming dominant. For example, for the m>27 case maybe we can just know what the size of the postage stamp should be. Do we have a new profile?
I'm profiling the code now using a pared-down instance catalog with just these faint objects.
I ran 5901 of these faint objects with the sensor model enabled (but using the Kolmogorov PSF to avoid having the initial atmospheric screen calculations dominate the runtimes for this many objects. I also disabled the sky background rendering for the same reason.). Here are the top 20 time-sorted profile entries:
In [3]: slow_stats.sort_stats('time').print_stats(20)
Sun Sep 9 15:58:00 2018 slow_sim.prof
546529588 function calls (526197265 primitive calls) in 573.040 seconds
Ordered by: internal time
List reduced from 11807 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
85492212 83.674 0.000 92.626 0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:721(floatconv)
11346 50.758 0.004 301.951 0.027 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:748(loadtxt)
35942058/17945526 31.640 0.000 34.078 0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:935(pack_items)
45368 30.291 0.001 30.291 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:666(_xValue)
17956908 25.218 0.000 80.284 0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:951(split_line)
17945526 23.925 0.000 116.551 0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/lib/npyio.py:1024(<listcomp>)
1439944 18.714 0.000 19.446 0.000 {built-in method numpy.core.multiarray.array}
35937124 17.720 0.000 31.298 0.000 /global/common/software/lsst/common/miniconda/py3-4.3.21-env/lib/python3.6/site-packages/numpy/compat/py3k.py:32(asbytes)
21529 15.782 0.001 17.054 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
17961342 13.251 0.000 13.251 0.000 {method 'split' of '_sre.SRE_Pattern' objects}
41158812/41145794 9.041 0.000 10.432 0.000 {built-in method builtins.isinstance}
85492215 8.952 0.000 8.952 0.000 {method 'lower' of 'bytes' objects}
11802 7.283 0.001 7.384 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
11694 7.239 0.001 7.301 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
20381207 7.214 0.000 7.214 0.000 {method 'split' of 'bytes' objects}
5671 6.995 0.001 9.727 0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
17982359 6.500 0.000 6.500 0.000 {method 'encode' of 'str' objects}
17820 6.317 0.000 6.317 0.000 {palpy.mappa}
1506 5.368 0.004 5.368 0.004 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/sensor.py:247(accumulate)
40825313 4.920 0.000 4.920 0.000 {method 'append' of 'list' objects}
This only took 10 mins, so I'm doing a longer run with x6 objects.
Implementing Mike's suggestion to cache the SiliconSensor
objects, we've reduced the runtime for these faint objects by ~58%. Here's the revised profile (for ~23k objects):
In [9]: new_stats.sort_stats('time').print_stats(20)
Sun Sep 9 17:43:40 2018 slow_sim_40k_sensor_cache.prof
215901925 function calls (207843695 primitive calls) in 913.135 seconds
Ordered by: internal time
List reduced from 11818 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
180168 123.694 0.001 123.694 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:666(_xValue)
72877 68.041 0.001 73.077 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
59632 38.489 0.001 38.794 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
46752 28.914 0.001 29.267 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
22521 26.874 0.001 34.217 0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
5526170 24.939 0.000 27.978 0.000 {built-in method numpy.core.multiarray.array}
68370 23.706 0.000 23.706 0.000 {palpy.mappa}
22520 16.579 0.001 16.579 0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:151(<listcomp>)
22520 16.116 0.001 16.116 0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:152(<listcomp>)
679850 15.251 0.000 79.044 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:232(evaluate_xy)
206 13.951 0.068 34.480 0.167 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:555(pixelCoordsFromPupilCoords)
68372 13.517 0.000 13.517 0.000 {palpy.aoppa}
45041 12.078 0.000 119.470 0.003 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:665(pixelCoordsFromPupilCoordsLSST)
67961 11.472 0.000 91.819 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstZernikeFitter.py:286(_apply_transformation)
454640 10.955 0.000 27.727 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:136(_evaluate_radial_array)
97962 10.771 0.000 10.771 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:2049(shoot)
205 10.082 0.049 10.082 0.049 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:610(<listcomp>)
22520 9.907 0.000 9.907 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:747(<listcomp>)
45040 9.531 0.000 9.531 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:207(getArgs)
679850 9.081 0.000 48.834 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:196(evaluate)
and using a fixed stamp size for objects with flux < 10
, gets us an additional ~10% speed-up. The profile with these two changes applied looks like:
In [16]: new_stats.sort_stats('time').print_stats(20)
Sun Sep 9 18:22:10 2018 slow_sim_40k_stamp_fix.prof
210748426 function calls (202690252 primitive calls) in 804.147 seconds
Ordered by: internal time
List reduced from 11818 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
72877 69.075 0.001 74.337 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
59632 38.703 0.001 39.012 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
46752 29.366 0.001 29.721 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
22521 27.911 0.001 34.995 0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
5301130 24.465 0.000 27.465 0.000 {built-in method numpy.core.multiarray.array}
68370 24.453 0.000 24.453 0.000 {palpy.mappa}
22520 17.792 0.001 17.792 0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:152(<listcomp>)
22520 17.646 0.001 17.646 0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:151(<listcomp>)
679850 14.601 0.000 74.982 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:232(evaluate_xy)
206 14.062 0.068 34.867 0.169 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:555(pixelCoordsFromPupilCoords)
68372 13.686 0.000 13.686 0.000 {palpy.aoppa}
45041 12.414 0.000 119.126 0.003 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:665(pixelCoordsFromPupilCoordsLSST)
67961 11.630 0.000 87.877 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstZernikeFitter.py:286(_apply_transformation)
97962 10.782 0.000 10.782 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:2049(shoot)
454640 10.420 0.000 26.118 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:136(_evaluate_radial_array)
205 10.272 0.050 10.272 0.050 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:610(<listcomp>)
22520 10.143 0.000 10.143 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:747(<listcomp>)
45040 9.948 0.000 9.948 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:207(getArgs)
418/269 9.798 0.023 10.533 0.039 {built-in method _imp.create_dynamic}
22520 8.796 0.000 8.796 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:989(<listcomp>)
@jchiang87 thanks for these extremely helpful profiles and updates! Did I understand correctly from your discussion with @rmjarvis above from yesterday that the current implementation of imsim uses an instance catalog that corresponds to the entire FOV, even though we only simulate one sensor at a time? If so, does the code perform all of the time-limiting per-object calculations on every object in that instance catalog? And if so, wouldn't even a very rough culling down to a circle of conservative radius around the sensor pointing center get us a savings of order 100X, given that the FOV has 189 sensors? Apologies if I'm missing something basic, but otherwise this would strike me as a promising place to improve what sounds like an inefficient use of CPU time and memory.
The code ingests the data from the instance catalog but defers all of the time-intensive computation until the simulation for any given sensor begins. I actually spent a lot of time trying to get the memory and cpu usage to be as manageable as possible where it made the most difference for the Run1.2i-sized instance catalogs. The original code was structured to read in the entire fov instance catalog because it was meant to take the same inputs as phosim. Phosim does the per sensor trimming by writing the sensor-sized catalogs to disk, whereas for imsim, that was done in memory. When the object lists were x10 smaller that was manageable, i.e. a few GB total. A factor of 10 with Run2.0, and 30GB becomes important. In any case, 100X is not in the cards.
Fits images before changes and after the sensor caching: lsst_e_219976_R22_S11_r.fits.gz lsst_e_219976_R22_S11_r_sensor_cache.fits.gz
With @rmjarvis' guidance, I think we've managed to speed things up by a factor of 3 to 4 in the faint object rendering. Here's the main PR with the changes, and here's the profile output after these changes:
stats.sort_stats('time').print_stats(20)
Mon Sep 10 18:36:03 2018 slow_sim_gs_bandpass.prof
130735048 function calls (127073037 primitive calls) in 535.756 seconds
Ordered by: internal time
List reduced from 11804 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
36984 38.416 0.001 41.072 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:87(__init__)
46752 29.479 0.001 29.839 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_photUtils/2.9.0.sims/python/lsst/sims/photUtils/Sed.py:956(addCCMDust)
36981 26.584 0.001 26.788 0.001 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/table.py:163(__call__)
22519 16.364 0.001 16.364 0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:151(<listcomp>)
22519 16.275 0.001 16.275 0.001 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimInterpreter.py:152(<listcomp>)
3096661 14.632 0.000 15.875 0.000 {built-in method numpy.core.multiarray.array}
206 14.487 0.070 35.931 0.174 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:555(pixelCoordsFromPupilCoords)
9246 10.898 0.001 14.588 0.002 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:30(__init__)
31765 10.871 0.000 103.761 0.003 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:665(pixelCoordsFromPupilCoordsLSST)
205 10.623 0.052 10.623 0.052 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:610(<listcomp>)
414340 10.597 0.000 53.758 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:232(evaluate_xy)
97863 10.322 0.000 10.322 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/galsim/1.6.0.lsst/lib/python/galsim/gsobject.py:2049(shoot)
22519 10.107 0.000 10.107 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstCameraUtils.py:747(<listcomp>)
28545 10.038 0.000 10.038 0.000 {palpy.mappa}
321880 9.009 0.000 20.977 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_utils/2.9.0.sims/python/lsst/sims/utils/ZernikeModule.py:136(_evaluate_radial_array)
41410 8.537 0.000 63.184 0.002 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/LsstZernikeFitter.py:286(_apply_transformation)
22519 8.206 0.000 8.206 0.000 /global/common/software/lsst/cori-haswell-gcc/stack/w.2018.26_sim2.9.0/stack/miniconda3-4.3.21-10a4fa6/Linux64/sims_coordUtils/2.9.0.sims/python/lsst/sims/coordUtils/CameraUtils.py:989(<listcomp>)
207 7.580 0.037 7.580 0.037 {built-in method numpy.core.multiarray._vec_string}
22519 7.454 0.000 7.454 0.000 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimDetector.py:359(<listcomp>)
31764 6.994 0.000 111.381 0.004 /global/cscratch1/sd/jchiang8/imsim_pipeline/sims_GalSimInterface/python/lsst/sims/GalSimInterface/galSimCameraWrapper.py:509(pixelCoordsFromPupilCoords)
I'll redo the full sensor-visit to check the final speedup.
Since this issue may be a more useful reference than the #desc-dc2-catlimits slack channel, I would like to record the outcome of that discussion. I'll do this in the form of a short set of bullets with links to the associated slack comment for those who want more detail:
The outcome is that we will go with proposal A (or effectively C-): keep the catalogs as-is, and simulate as many years of Run 2.0i WFD as we can given the available resources (3-5 years depending on how things go). This choice was enabled in part by multiple optimizations described in this GitHub thread, see Jim Chiang's comments above.
Qualitative science justification: Mike Jarvis provided an expanded version of the science justification for wanting objects below the detection limit for WL here; Ian Dell'Antonio responded immediately after that with the justification for CL. Both were based on previous studies that justify the need for an input catalog below LSST depth, but with some uncertainty as to how far below is needed. Michael Wood-Vasey produced an NSF-style broader impact statement beyond WL and CL here.
Quantitative science justification: There was agreement that we'd like in future to better quantify the needed input catalog depth, but we don't have tools available to do so now (DC2 might become a way to do that, to feed into future simulation efforts, which should consider the input catalog when evaluating resource constraints at an earlier stage of the process if possible).
Additional nuances: Anze commented on area vs. depth tradeoffs and the connection between input catalog depth vs. how many years of survey are simulated (among other issues) here. Multiple people (Andrew Hearin, Eric Gawiser, and others) commented on the potential uncertainties in the simulated faint galaxy population starting around here.
Sanity check of faint galaxy number counts: I provided a comparison with published number counts from pencil-beam surveys, concluding as follows: "when compared with very deep pencil-beam surveys in i-band, the cosmoDC2 cumulative number counts agree with those deep pencil-beam surveys to ~10% at I<26.5 and I<28 (better than I had hoped for!). In z-band, the agreement seems to be 10% at z<24 and more like 50% at z<28." See this post for obnoxious amounts of detail and references.
@katrinheitmann - I am closing this issue, but feel free to re-open if needed.
For those interested in the final summary of the code changes to speed up the runtime, I have a comment here in the imSim repo.
While this issue has been already closed a while ago, the question did get revisited later and the conclusion that was stated above was overturned (to not make any cuts). As discussed in the DC2 telecon on Nov. 14, the final decision was to make a cut in r~29 (https://docs.google.com/presentation/d/1FekyEB5lEMy7m8Jst4_N5pK-a_iWsL-Po3Wx7E9wA-E/edit#slide=id.g47b4924efc_0_0). This was finally implemented in Run 2.1i and is the cut in the DC2 image simulations. However, cosmoDC2_440 has no cuts, so the first part of plan A still holds.
All, the discussion in issue #261 has gotten quite far off from the topic of that issue (checklist for cosmoDC2) and I would like to put together a list of options here that we have to consider to make progress. Please feel free to edit the notes below, in particular @jchiang87, @aphearin, and @villarrealas
Plan A: Keep the full area and depth that we currently have in cosmoDC2
Plan B: Make an r-band cut at ~27
Plan C: Reduce the footprint from 300 sq degree to ?? or reduction from 10 years to ?? years
Plan D: Go with Plan B and at a later stage add smaller areas at full depth later All is the same as in Plan B but we might have a lesser hit on the science
Plan E: Make an r-band cut at ~28 and downsample the number of galaxies at > 28 to get back a factor of 5 in memory requirement
Option E was just suggested by Andrew H. and seems interesting.
As pointed out, in order to make a decision, we need to (a) get a clearer idea of the resource requirements if we go with anything but Plan B and (b) understand which projects in our list will be affected by any plan that is not Plan A.