LSSTDESC / ComputingInfrastructure

Gathering place for CI - Computing and Infrastructure - issues
3 stars 1 forks source link

Inputs for AY2018 ERCAP/NERSC allocation proposal #51

Closed richardxdubois closed 6 years ago

richardxdubois commented 6 years ago

We need to submit our request to NERSC for our AY2018 allocation by COB on Monday, Oct 16. It will largely be based on DC2 needs - and we need to be sure to capture tape (SRU) needs this time. Salman has collected a handful of individual requests as well.

I believe the current PhoSim estimate is 70M NERSC-hrs for the 300 sq deg sky.

What else?

salmanhabib commented 6 years ago

Who is this Salam guy? ;-) Yes, there are other cases -- I suspect we will end up asking for about 100M NERSC hours. I'll start putting stuff together today and aim to have a rough cut done Friday night.

cwwalter commented 6 years ago

imSim run + DM processing.

Is it possible to actually do some run and see if the 70M hr # for PhoSim is reasonable in practice first?

richardxdubois commented 6 years ago

I've been asking for a fractional DC1 rerun with 3.7 to benchmark it...

salmanhabib commented 6 years ago

Chis, do you guys have estimates for what you need? The DC1 tests that we'll run with PhoSim now should tell us if the 70M estimate is robust. Hope to push on some tests next week.

jchiang87 commented 6 years ago

Using results from James' performance studies, for the imSim runs I get

NERSC hours = 96(824 pointings)(30 fovs)(1.4 dither factor)(189 sensors)*(1.4 KNL hours per sensor-visit)/(48 jobs/node) = 18M

This is similar to the calculation Salman used for the phosim estimate.

Based on James' cProfile results, where GalSim drawImage was only 6% of the cpu time, I'm sure we can reduce this by a large factor by doing the sims calculations more efficiently, but this should serve as a baseline estimate until we can figure out how to streamline the sims stuff.

richardxdubois commented 6 years ago

For SRUs - we estimated 1 instance of DC2 would take up 1 PB (compressed) of image and related files. (reminder: and we've only budgeted for 1 PB of storage to buy at NERSC...)

jchiang87 commented 6 years ago

So we need to decide if we really want full image simulation for both phosim and imsim. I would expect not for the final DC2 dataset.

richardxdubois commented 6 years ago

or can we be creative - would both instances need to be fully on disk?

cwwalter commented 6 years ago

Right, we might only need the full individual exposures and warps etc always on disk for a few patches for people to do studies. We might only need the co-adds there all the time.

jchiang87 commented 6 years ago

Here is an estimate for Level 2 processing of 1 instance of DC2 image data:

NERSC hours = 96*(824 pointings)*(30 fovs)*(1.4 dither factor)*(21 rafts)*(6 KNL hours/raft-visit)/(68 raft-visits/node) = 5.5M

This is based on 1 Haswell hour per raft-visit estimated from the DC1 undithered processing, where 90% of the total processing time was for the processEimage task, which took ~0.9 Haswell hours/raft. The maximum memory usage for each single core job was ~1.1GB so we should be able to use all 68 cores/node given the 96GB memory per KNL node.

If we analyze both phosim and imsim instances of the DC2 data, that's 11M NERSC hours.

richardxdubois commented 6 years ago

Many thanks to @salmanhabib for preparing the ERCAP proposal. To recap the resources:

130M NERSC-hrs 10M SRU 1 PB HPSS 200 TB scratch

I'm wondering if we want 2 PB HPSS, if we end up doing full PhoSim and ImSim instances of DC2.

The request needs to be submitted by midnight PT today.

salmanhabib commented 6 years ago

Right, I put in the 1PB on HPSS as a place-holder, one can always ask for more --

On 10/16/17 11:56 AM, Richard Dubois wrote:

Many thanks to @salmanhabib https://github.com/salmanhabib for preparing the ERCAP proposal. To recap the resources:

130M NERSC-hrs 10M SRU 1 PB HPSS 200 TB scratch

I'm wondering if we want 2 PB HPSS, if we end up doing full PhoSim and ImSim instances of DC2.

The request needs to be submitted by midnight PT today.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/ComputingInfrastructure/issues/51#issuecomment-336951691, or mute the thread https://github.com/notifications/unsubscribe-auth/AInM9VXkJQCOQC-YX6WZzsIHnLBiwSkBks5ss4rWgaJpZM4P3XAN.

jchiang87 commented 6 years ago

According to Tony's processing numbers for DC1, the bulk of the resource cost for the coadd processing are the makeTempExpCoadd, assembleCoadd, and measureCoadd tasks. The number of jobs for each coadd task was the number of sky "patches" times the number of bands, and the number of patches scales with the sky coverage. For DC1 dithered, the number of patches was 841.

makeTempExpCoadd: max memory for DC1 = 6GB -> 16 cores/KNL node avg cputime for DC1 = 4262./3600. Haswell hours NERSC hours = 96*841*(300 sq deg/40 sq deg)*(6 bands)/16.4262./3600.6 = 1.6M

assembleCoadd: max memory for DC1 = 20GB -> 4 cores/KNL node avg cputime for DC1 = 2971./3600. Haswell hours NERSC hours = 4.5M

measureCoadd: max memory for DC1 = 37GB -> 2 cores/KNL node avg cputime for DC1 = 6349./3600. Haswell hours NERSC hours = 19.2M

Combining these numbers with the processEimage estimate above (accounting for the 0.9 factor error), the grand total is 5.5+1.6+4.5+19.2 = 31M NERSC hours for 1 DC2 instance.

There were reportedly memory bugs in the coadd task Stack code which are being fixed, so these numbers are likely upper limits.

richardxdubois commented 6 years ago

I was just trolling our records, and see that at the DOE Ops review in April we had asked for 150M hrs in each of FY2018 and 2019.

richardxdubois commented 6 years ago

We've upped the CPU request to just under 150M to handle the L2 need.

I'm still a bit unsure about the SRUs. Just seems odd that we've used 1.4M so far for DC1, but only need 10M for DC2? Maybe @djbard or @MustafaMustafa could confirm 10M is fine?

salmanhabib commented 6 years ago

I believe this is connected with the number of files. Probably there are too many small files; they should be aggregated as a matter of course. If this is done in a reasonable manner, the SRU count will be fine. Either way, we can always ask for more once we start running DC2, it shouldn't be a problem.

On 10/16/17 2:05 PM, Richard Dubois wrote:

We've upped the CPU request to just under 150M to handle the L2 need.

I'm still a bit unsure about the SRUs. Just seems odd that we've used 1.4M so far for DC1, but only need 10M for DC2? Maybe @djbard https://github.com/djbard or @MustafaMustafa https://github.com/mustafamustafa could confirm 10M is fine?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/ComputingInfrastructure/issues/51#issuecomment-336998947, or mute the thread https://github.com/notifications/unsubscribe-auth/AInM9S8QHCIUrczjqGELeF6v9sahgyPkks5ss6kLgaJpZM4P3XAN.

MustafaMustafa commented 6 years ago

@salmanhabib, the number of files factor is negligible, here is the formula: yearly user SRUs = 0.01436files + 4.787space(GB) + 4*I/O(GB) (https://goo.gl/D2rhJw)

@richardxdubois , according to that equation, 10M should be enough for the planned 1PB.

salmanhabib commented 6 years ago

Ok, that was the only thing I could think of -- the 10M estimate came from my estimates of overall size and IO and a reasonable file size. In fact, I built in a safety factor of about 30% --

Who can make sense of this formula ;-)

On 10/16/17 2:43 PM, Mustafa Mustafa wrote:

@salmanhabib https://github.com/salmanhabib, the number of files factor is negligible, here is the formula: yearly user SRUs = 0.01436/files + 4.787/space(GB) + 4*I/O(GB) (https://goo.gl/D2rhJw)

@richardxdubois https://github.com/richardxdubois , according to that equation, 10M should be enough for the planned 1PB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/ComputingInfrastructure/issues/51#issuecomment-337013013, or mute the thread https://github.com/notifications/unsubscribe-auth/AInM9ee16iHzlo5o8B5pifY0XXvq1l_Xks5ss7HKgaJpZM4P3XAN.

richardxdubois commented 6 years ago

But since we upped the storage to 2 PB, in case we do 2 full DC2 instances, maybe we should go with 15M?

richardxdubois commented 6 years ago

We set it to 15M and I just hit the 'submit' button. Thanks everyone for gathering up what we needed to ask for, and again to @salmanhabib for preparing it!

richardxdubois commented 6 years ago

Got the receipt notice from NERSC...