Closed ryanmrichard closed 5 years ago
FWIW this issue is intended to be a discussion so please comment and/or correct anything in it so far.
It would be disconcerning if we have to have two separate runtimes as it greatly complicates NWChemEx for future developers. They would have to decide to write code using TAMM or another parallel runtime.
What parts of NWChemEx outside of TAMM do we expect we need to parallelize? It might be good to inventory that first.
On Fri, Jan 11, 2019 at 9:45 AM Ryan Richard notifications@github.com wrote:
FWIW this issue is intended to be a discussion so please comment and/or correct anything in it so far.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NWChemEx-Project/SDE/issues/67#issuecomment-453599340, or mute the thread https://github.com/notifications/unsubscribe-auth/AGa9ck1B5BkQchepm0u5hT3PqRkh0cYJks5vCM3QgaJpZM4Z75qS .
@wadejong while having two runtimes is far from ideal it looks like it's the boat we're in. As for applications of the non-TAMM runtime, in the short term (and part of the motivation for this issue) we need it for local methods. At the moment @keipertk and @jboschen are coding up local SCF outside of TAMM for a number of reasons. Outside of TAMM local methods are most naturally expressed as nested loops over tensor expressions that are working at a block level (think along the lines of direct SCF, but with now with loops for domains too). Aside form local methods, and just off the top of my head: finite-difference, potential energy surface scans, many-body-expansion, and basis set superposition error corrections.
Now I am confused, I thought TAMM was going to be able to support the local SCF?
The examples you list at the end seem to be workflow or parallel task queue cases (which underneath potentially could still run TAMM). Would be interesting to see if we can simply hand TAMM a process group communicator from a higher level. Sounds like we need two things:
Resource manager, object that has the MPI_World and can serve to farm work from a task queue to process groups
A lightweight runtime that can take a communicator and perform a task, which still could involve TAMM.
On Fri, Jan 11, 2019 at 11:33 AM Ryan Richard notifications@github.com wrote:
@wadejong https://github.com/wadejong while having two runtimes is far from ideal it looks like it's the boat we're in. As for applications of the non-TAMM runtime, in the short term (and part of the motivation for this issue) we need it for local methods. At the moment @keipertk https://github.com/keipertk and @jboschen https://github.com/jboschen are coding up local SCF outside of TAMM for a number of reasons. Outside of TAMM local methods are most naturally expressed as nested loops over tensor expressions that are working at a block level (think along the lines of direct SCF, but with now with loops for domains too). Aside form local methods, and just off the top of my head: finite-difference, potential energy surface scans, many-body-expansion, and basis set superposition error corrections.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NWChemEx-Project/SDE/issues/67#issuecomment-453631883, or mute the thread https://github.com/notifications/unsubscribe-auth/AGa9cmVvRd4Mhv3Z4XOzlCAhwqWMAYj9ks5vCOb5gaJpZM4Z75qS .
@wadejong as I understand it (and others feel free to correct me if I'm wrong) eventually TAMM will support local SCF, at the moment there's some issues with dependent index spaces preventing us from going forward. The consensus of the last local SCF meeting was that the best course of action, for all involved, would be to get an actual local SCF implementation, even if that implementation uses Eigen. While the hope is that we'll be able to port @jboschen and @keipertk's implementation over to TAMM before we need to show parallelization, the reality is we may need to start demonstrating parallel code before that can be done. If that's the case you're talking about parallelizing over a relatively decoupled set of tasks so it shouldn't be too bad with a runtime that supports tasks.
Unfortunately, as I see it, whatever runtime is used by SDE is going to be the one calling the shots from our perspective (it of course will need to support the use case where it's a subprocess, but we'll ignore that for the second). This means this runtime will need to potentially know about all of the resources available to it, not just processes (I'm assuming the typical CPU model each socket is an MPI process, and threading occurs over the cores in this socket; I guess I'm thinking each GPU is an MPI rank and then there's threading across the GPU, but I really haven't given much thought to GPUs). Hence I think we need to think broader than MPI alone; there's going to be some threading aspect there too. I take this to mean we're going to have to give TAMM more than a process group and that it's going to be somewhat hard to make this runtime lightweight since it's the driving force.
I think this is resolved with Taskforce being charged with implementing NWX's runtime.
Is your feature request related to a problem? Please describe. Yes. We don't have a way to parallelize things outside of TAMM.
Describe the solution you'd like Ultimately, I'd like a solution that doesn't require me to do coding :slightly_smiling_face:. To this end it's worth noting that there are a bunch of runtimes out there, including some funded by ECP. That said the real question is "do any of those runtimes suit us?". To that end it helps to have a better idea of what we want in a runtime, which is the main focus of this issue.
I personally have found runtimes like Intel's TBB to very easy to use and capable of good performance. Obviously TBB isn't going to get us across Summit though. For those not familiar with TBB it basically gives you functions that loop over a series of lambdas. You can change the queue used for the loop (and possibly how the lambdas are scheduled). TBB promotes asynchronous computing via the use of continuations. I'd like to find/write something that uses similar coding concepts while supporting parallelization beyond threads. Historically, when I make such a suggestion @evaleev and @robertjharrison usually advocate for something more higher-level than this, but I have to admit, I'm not completely sure what this higher-level looks like---just parallel objects like tensors and STL-like containers?
Finally, I'd be remiss if I didn't include @twindus's suggestion from the conversation that prompted this issue, namely a series of
get
,put
,gather
,scatter
functions that work with own resource descriptions (as I understand it an MPI-like API that goes over MPI as well as OpenMP and possibly other threading/distribution models). Of course such a layer would be immensely useful for writing a TBB-like layer that deals with mulitiple types of parallelism.Describe alternatives you've considered Waiting for the compiler to magically parallelize the code.
Additional context This issue is necromancy on #1 and #4. Those PRs ultimately never got merged (and it's not clear to me that any of the code in them is worth revisiting). I'll leave to the reader to read the discussions on those PRs (it's not too much). The overall punchline of those PRs was that the TAMM team was going to implement the runtime. That said, I am under the impression that this is no longer the case, rather the TAMM team is going to implement TAMM's runtime and the rest of NWX needs to implement/adopt their own. If this is the case it becomes essential that NWX's runtime plays nicely with TAMM's, which is going to be challenging without a thorough discussion.