Investigate adding cores to Edison/Cori shared partition

richardxdubois commented 7 years ago

At 2500 cores, Tom estimates 3/4 yr to run DC1. Gulp. The least work mode for DESC currently is to use the shared partition. To be useful we would need factors more than 2500 cores. @djbard was asked to investigate upwards of 10k cores in the shared partition.

djbard commented 7 years ago

An increase of 50% in the shared queue has already been approved, i.e. to 3% of the Haswell partition, from 2% currently. We'll have 1900 Haswell nodes after integration - so that's 1824 cores. To motivate a further increase you'll need make some slides that I can take to management - Cori will be under a lot of pressure when it's returned to users, so we'll need to have a very strong case for dedicating more resources to the shared queue. Fair warning: I think it's very unlikely we'll get 300 nodes in the shared partition, but let's make the case. @TomGlanzman I think the slides you showed last week had most of this material - that's where you should start. You'll need to address following questions:

What's the memory footprint for each job (i.e. what you'll be requesting in the queue)?
What's the time length you'll be requesting for each job?
What's the time frame - i.e. when will you start submitting the jobs? When do they need to be finished?
What development work will happen on the code and on what timeframe (e.g. reduce memory footprint, use checkpointing)? It will be useful if you can demonstrate that DESC is working on making the code fit on NERSC architecture, particularly in the long term, rather than just demanding NERSC resources fit the code.
Why can't you use some method of packing multiple jobs onto one node? (e.g. qdo or http://www.nersc.gov/users/data-analytics/workflow-tools/taskfarmer/). This will be the first question you'll be asked :)

TomGlanzman commented 7 years ago

Thanks Debbie. Understood that slides will be needed. What follows continues the discussion eventually leading to those slides.

The raytrace process is taking a stable 2-3 GB to run. The parent python script can take much more and has been brought to the attention of the phoSim developers. The question is still open.
Assuming we can get some form of checkpointing running successfully, the job times are somewhat arbitrary. Keeping in mind the trade-off between short jobs and start-up I/O load, something less than the maximum (48 hours) is one possibility -- say 40 hours. Or, we could agree to use smaller chunks if that would be a better fit to NERSC ops., such as 10-24 hours. If we cannot get checkpoint working, then we must pick the longest possible time in order to maximize the number of successful jobs, i.e., 48 hours.
The time scale for beginning the Deep DC1 project is November (??) with a 1-2 month phoSim generation. Chris Walters or others may wish to clarify this point.
The memory footprint issue is still being understood. Until the development team recognizes this as a problem, no work is likely to happen. From the bitbucket chat I've seen, my guess is that this is probably a solvable problem with modest effort. But I could be wrong. On the other hand, the development team is working on multi-threading the phoSim code. We users do not have anything in the way of technical details of how this will be implemented or whether it will be successfully completed (and tested, validated, etc.) on a time frame of interest to Deep DC1. In the long term, this is the way the phoSim developers are headed. Clearly multi-threading will likely directly address the issue of utilizing many cores on a single host - if the developer's promise holds true.
There is probably no reason we cannot try to use the tools you mention, other than the effort required to investigate them and develop a completely new way of running a large-scale phoSim production. The bookkeeping, file organization, monitoring, submission and statistics are some of the aspects that the SLAC Pipeline workflow engine provide. From long-ago discussions, I did not think these other tools provided that level of support, but will look again. Alternatively, creating an interface from the SLAC Pipeline to one of these tools...but that will require the specialized expertise of and development by Tony/Brian.

It seems fair to ask NERSC a question about their system and how it is managed. I have heard various rumors about increasing support for data intensive computing. Is this term a proxy for jobs that run in the shared/serial queue and, if so, why the limit of 3%? Once Cori comes fully online, will not the lion's share of processing be on the KNL partition? How about dedicating the entire Haswell partition to a more flexible arrangement of adjusting the queue/partition boundaries dynamically according to need and demand, up to and including 100% of Haswell to single-core jobs?

Tom

cwwalter commented 7 years ago

The time scale for beginning the Deep DC1 project is November (??) with a 1-2 month phoSim generation. Chris Walters or others may wish to clarify this point.

Our original schedule (see these milestones): https://github.com/DarkEnergyScienceCollaboration/SSim_DC1_Roadmap/milestones

So our plan was that this month we would set things up and do the validation and then start next month with the production taking one or two months.

Clearly this is not going to be done in the next two days so we need to reset the time. I would hope one month would be a maximum.

The memory footprint issue is still being understood. Until the development team recognizes this as a problem, no work is likely to happen. From the bitbucket chat I've seen, my guess is that this is probably a solvable problem with modest effort. But I could be wrong. On the other hand, the development team is working on multi-threading the phoSim code. We users do not have anything in the way of technical details of how this will be implemented or whether it will be successfully completed (and tested, validated, etc.) on a time frame of interest to Deep DC1. In the long term, this is the way the phoSim developers are headed. Clearly multi-threading will likely directly address the issue of utilizing many cores on a single host - if the developer's promise holds true.

From what I have heard we should not be planning on relying on multi-threaded PhoSim for DC1. I don't think it has been made publicly available yet.

djbard commented 7 years ago

Any chance of getting these slides together soon? We've already started the queue discussion for Cori as a whole, and it's hard for me to make the case for your needs without something substantial to show the group (but rest assured I am arguing the case).

From @TomGlanzman:

It seems fair to ask NERSC a question about their system and how it is managed. I have heard various rumors about increasing support for data intensive computing. Is this term a proxy for jobs that run in the shared/serial queue and, if so, why the limit of 3%? Once Cori comes fully online, will not the lion's share of processing be on the KNL partition? How about dedicating the entire Haswell partition to a more flexible arrangement of adjusting the queue/partition boundaries dynamically according to need and demand, up to and including 100% of Haswell to single-core jobs?

There is zero chance of the entire Haswell partition going to the shared configuration - it has to serve a large user base - but if you can make the case then we can try to press for a larger proportion than currently planned. Data-intensive computing does include "high-throughput" single-core jobs, but also encompasses large multi-node machine-learning codes, high-IO multi-node codes and real-time computing for running experiments. If you're really interested, we have a paper on the topic here.

richardxdubois commented 7 years ago

We're working that today. The basic issue is getting signoff to cut out bright stars. that could make a x4 difference in the request. If that pans out, and we say we want a 1 month turnaround, the request could be that we get 3k cores for that month...

djbard commented 7 years ago

Interesting! I'll join the Twinkles meeting today and catch myself up.

richardxdubois commented 7 years ago

Actually it is a x8 difference, not 4... I'll be at physio during the Twinkles meeting though.

cwwalter commented 7 years ago

We wound up using about 20M hrs for DC1. The CI group is now making estimates of our DC2 needs.

LSSTDESC / SSim_DC1

Investigate adding cores to Edison/Cori shared partition #21