LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

cosmoDC2 Run 2.0 checklist #261

Closed katrinheitmann closed 6 years ago

katrinheitmann commented 6 years ago
danielsf commented 6 years ago

I just want to record this here: we are going to need cosmoDC2_v1.0.0 to be subdivided into healpixels finer than nside=8. Testing on v0.4 yielded a memory footprint of 66GB. That is almost certainly too large for production.

patricialarsen commented 6 years ago

@danielsf Thanks for getting an answer on that, will nside=16 be okay (a quarter of the size of what we currently have) or do we need finer resolution again? Also I assume this footprint test was on the group of 4 pixels for corner cases, if not let me know.

evevkovacs commented 6 years ago

I added the item to the checklist, but left the further subdivision TBD. Can you defintely confirm that Nside=32 will be needed?

danielsf commented 6 years ago

@patricialarsen I did test on an intersection of 4 healpixels.

Going to nside=16 will probably get us back to the memory footprint (~10GB) we saw on cosmoDC2 v 0.1. @TomGlanzman said that was "okay", but reducing by another factor of four would be better (something to do with how many InstanceCatalogs we can generate on one node at one time). How hard would it be to go down to nside=32?

aphearin commented 6 years ago

We could throw away all galaxies fainter than 28 magnitude in, say, the LSST r-band. That would reduce filesize by a factor of 5x and should have no observational consequences that are relevant to the survey.

Here's a simple plot of all galaxies at all redshifts:

appmag_function

CC @rmandelb

jchiang87 commented 6 years ago

Using cosmoDC2_v0.4.12 with the generateInstCat.py script, the resulting instance catalogs in r-band that I'm getting appear to be a factor of 10 larger than what we saw for Run1.2i. This many more objects has a significant impact on the memory used by imsim, increasing the per sensor-visit memory footprint by more than 1GB, in addition to making the runtimes longer. So if we could reduce the number of galaxies to render by a factor of 5 at least, I'd feel a lot more comfortable about our chances of running smoothly on Theta next week.

aphearin commented 6 years ago

Here's a useful figure for gauging the filesize reductions we can get by making cuts on apparent magnitude. So, as above, 5x for a cut at 28, and a full order of magnitude reduction if we throw out galaxies fainter than 27 magnitude.

cumulative_fraction

jchiang87 commented 6 years ago

It would be great if we could recover that factor of 10 by making an r-band cut ~27. In addition to the memory hit, the runtimes also scale with number of objects in that regime. Given our Run1.2i experience, even a factor of 2 longer runtimes for Run2.0i will be a big deal.

aphearin commented 6 years ago

I am perfectly happy to delete either r or i > 27 wholesale. That would also be my preference, actually, since so little science will be impacted, and all aspects of the compute burden get a huge reduction. The only reason these galaxies were there in the first place was to complete the luminosity function down to i<26.5 in a physically motivated way for blending studies. Who else should be consulted about this choice? @RobertLuptonTheGood? @rmjarvis?

evevkovacs commented 6 years ago

Reducing the file size by making a magnitude cut is preferable for many reasons: drastic reduction in the amount of storage needed for the catalog, faster I/O, faster run times for validation etc. The current file size for 1 healpixel is about 630 GB. We can live with this for 14 healpixels, but we will definitely need to reduce the file sizes when it comes to storing the full octant, which has ~100+ healpixels.

patricialarsen commented 6 years ago

@danielsf to answer your question it's just as easy to divide into nside=16 or nside=32 (giving a 4x or 16x reduction in filesize). I'll make sure that we get at least the factor of 16 reduction in galaxy numbers you're hoping for when combining the size reduction and the magnitude cut and make the subdivision based on that if that sounds good to you.

cwwalter commented 6 years ago

I am perfectly happy to delete either r or i > 27 wholesale. That would also be my preference, actually, since so little science will be impacted, and all aspects of the compute burden get a huge reduction. The only reason these galaxies were there in the first place was to complete the luminosity function down to i<26.5 in a physically motivated way for blending studies. Who else should be consulted about this choice? @RobertLuptonTheGood? @rmjarvis?

I think @rmandelb will have the a good overview of the possible impact on the analysis groups. For this discussion in particular I would think the blending task force leads @dkirkby and @burchat should be consulted.

rmjarvis commented 6 years ago

Whoah! Hold on a bit on the magnitude cuts. I think there is a fair amount of science that is counting on the input catalog being fairly deep. Like 2 magnitudes fainter than the (coadd) limiting magnitude kind of deep.

My understanding is that LSST's r-band limiting magnitude is expected to be around 27.8 (according to the Science Book). If that's right and DC2 is trying to get to that depth, then we definitely don't want to be throwing out r or i > 27! Even r > 28 would be cutting it quite close to the limiting depth.

Probably you could safely cut objects with mag > 30 in all bands. Or more specifically, mag > limiting mag + 2 in all bands. That's about the sub-detection level that people think impacts the measurements of detected objects via blending effects, background subtraction, sky estimation, etc. It would definitely be a shame to not probe any of these possible effects in DC2 if it's not absolutely required.

katrinheitmann commented 6 years ago

@rmjarvis I think given the amount of resources we have right now (and the amount of resources worthwhile in general to spend on DC2, remember 1 cent/core hour ...) we have to be more specific about the science we would lose if we would make the cuts described above. We also have to think about if we do not make the cuts described above, can make other cuts? (E.g. reduction of area?) The blending task force indicated that they were not really ready for DC2 in general anyway. For their goals, do they need 300 sq degrees? Or can we in the future go back to a much smaller footprint without cuts and they would be happy with that as a test set? Same for sky estimation, background subtraction etc. Do we need 300 sq degrees? At this point, if we want to do anything with DC2 in the near future, we will have to make some compromise. If you could write down requirements (area etc) and science goals for each of the topics that we lose if we were to make the cuts suggested above so we could evaluate if we can do them at a later stage that would be very helpful. @burchat @dkirkby @rmandelb Please chime in as well! Thanks all --

salmanhabib commented 6 years ago

Or maybe one can do 300 sq. degs. with the shallower option and a mini-survey with the deeper option (tens of square degrees), depends on the science case and whether the image catalog can be trusted at these magnitudes anyway.

rbiswas4 commented 6 years ago

I have a quick comment. The SN that have been put in (and will be put in by applying the code to other quadrants) likely include hosts with dimmer magnitudes. As you go to higher and higher redshifts such a cut affects a larger and larger fraction of the population.

On the other hand, the number of SN (only out to z ~ 1.4 in mDDF and ~ 1. in MS) is already tiny (across all redshifts in a NSIDE=8 healpixel ~ 60K) compared to the number of galaxies, and a majority of them don't live in such galaxies. So if the cut was modified to keep in the few galaxies that host SN out of those being removed I think nothing would change in terms of performance.

dkirkby commented 6 years ago

If the full 300 sq.deg. needs to be cut at r~27, I suggest also simulating a small area (3-5 sq.deg.) with a cut at r < 29. This would provide a straightforward way to test the sensitivity of any DC2 study to this cut, and also provide a useful dataset for studies where we know objects below the detection threshold are important but are currently limited to the 1 sq.deg. r < 28 CatSim catalog.

egawiser commented 6 years ago

In my opinion, this discussion needs to be paused and broadened. If we can only simulate 300 square degrees by cutting out half of the galaxies that would be detected in 10-year LSST imaging, we should seriously consider simulating a smaller area - or just simulating the first 1-3 years of LSST WFD imaging so that our input catalog and simulated depth are better matched. This conversation of course would have been nicer to have at the outset of DC2 RQ, but I understand that we're learning about conflicts between storage and CPU requirements and available resources as we go.

cwwalter commented 6 years ago

Dear All,

Echoing Eric comments: First of all, I think it would be useful to hear what our technical constraints really are. Perhaps there are side discussions happening but from this thread, I'm not sure I really understand if we can't do what we have been planning now, or if it will just take longer, if so how long, and if there might be technical solutions to the problems we could implement.

Of course we have gone through a long process with input from many stake holders to understand what people need for these studies and our goal is to produce a set of simulations that are generally useful that will last us a few years. So, making last minute changes on a time scale of few day is probably not a good idea and is additionally likely to induce errors we will not catch. Our goal is to do a good job with outputs that we will be able to use effectively. As noted by Mike, if we cut near or over our limit, we won't even be able to measure the shape of our efficiency of detection.

If we find we really can't make things run as we were planning, I would suggest the best thing is to pause and not start the 2.0i run now. I understand the strong desire to not lose this time but perhaps we could find another focused data set to make or at least run another 1.2 with all of our options etc on which we could then check carefully until we start run 2.0i with new ALCF or NERSC or GridPP resources soon.

I think another option (I think also suggested by Eric above) if it turns out that doing things as we expect now is just going to be slower (as opposed to impossible) is to (say) run the 1st 3 years as we were planning. This would still use the ALCF time and give us a data set to look at and carefully check. Then we could work on technical solutions or reevaluations to be able to run more quickly and finish as soon as we request more time.

But, mostly as Eric says "this discussion needs to be paused and broadened" and not everyone who needs to weigh in is available today.

katrinheitmann commented 6 years ago

I am going to move this to a new issue. This issue was really a check off list for cosmoDC2. We should continue working on making cosmoDC2 ready in any case. Maybe that will still include some more cuts but the discussion about Run 2.0 specification is not in the right place here.

On 9/8/18 12:02 PM, Chris Walter wrote:

Dear All,

Echoing Eric comments: First of all, I think it would be useful to hear what our technical constraints really are. Perhaps there are side discussions happening but from this thread, I'm not sure I really understand if we can't do what we have been planning now, or if it will just take longer, if so how long, and if there might be technical solutions to the problems we could implement.

Of course we have gone through a long process with input from many stake holders to understand what people need for these studies and our goal is to produce a set of simulations that are generally useful that will last us a few years. So, making last minute changes on a time scale of few day is probably not a good idea and is additionally like to induce errors we will not catch. Our goal is to do a good job with outputs that we will be able to use effectively. As noted by Mike, if we cut near or over our limit, we won't even be able to measure the shape of our efficiency of detection.

If we find we really can't make things run as we were planning, I would suggest the best thing is to pause and not start the 2.0i run now. I understand the strong desire to not lose this time but perhaps we could find another focused data set to make or at least run another 1.2 with all of our options etc on which we could then check carefully until we start run 2.0i with new ALCF or NERSC resource soon.

I think another option (I think also suggested by Eric above) if it turns out that doing things as we expect now is just going to be slower (as opposed to impossible) is to (say) run the 1st 3 years as we were planning. This would still use the ALCF time and give us a data set to look at and carefully check. Then we could work on technical solutions or reevaluations to be able to run more quickly and finish as soon as we request more time.

But, mostly as Eric says "this discussion needs to be paused and broadened" and not everyone who needs to weigh in is available today.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2-production/issues/261#issuecomment-419657842, or mute the thread https://github.com/notifications/unsubscribe-auth/AMQ9jKinWjwWnaGCte5BAcm5re2UfMUGks5uY_gdgaJpZM4WfRii.

cwwalter commented 6 years ago

For people not following this repo who were mentioned and want to follow this conversation you, can find it in:

https://github.com/LSSTDESC/DC2-production/issues/263

TomGlanzman commented 6 years ago

Responding to @danielsf and memory footprint of instanceCatalog generation, in general the smaller the better. Running at NERSC we are limited to either Cori-haswell (2 GB/execution thread), or Cori-KNL (0.35 GB/execution thread). Achieving 2-4 GB/instance would be a reasonable match to the computing resources. Larger memory footprint means we are paying for CPU resources that are not used.

Does someone have a timing estimate for the instanceCatalog generation step? Does execution time vary depending on the healpix 'nside' value?

danielsf commented 6 years ago

I was just able to generate a fov=2.1 degree InstanceCatalog in 3 hours and 20 minutes. I ran /usr/bin/time, but forgot to specify --verbose. The output was

10876.51user 508.62system 3:19:49elapsed 94%CPU (0avgtext+0avgdata 21382336maxresident)k
0inputs+81137576outputs (0major+69408272minor)pagefaults 0swaps

which I interpret to mean a 21GB memory footprint.

Is this consistent with your test, @jchiang87 ?

jchiang87 commented 6 years ago

Here's the memory profile plot for my run using cosmoDC2-v1.0: cosmodc2_v1 0_219976_2 04deg The memory plateaus at ~16GB, but I used a smaller fov (2.04 deg) if that matters at all. The lower parts after ~105 min correspond to the gzipping of the txt files. The output from the time command is

real    176m6.078s
user    154m38.212s
sys     10m59.831s
katrinheitmann commented 6 years ago

cosmoDC2 has been delivered and has been officially released. All action items listed above have been taken care off.