Closed eirrgang closed 4 years ago
I'm not sure if I understand the question precisely, but:
Most of the anticipated workflows would require a single "task" or "command" (I've lost track of our terminology) to be able to execute in a parallel fashion on a single (fat) node. We can talk about whether that has to be single-node or multi-node, but the worker should be able to launch a process that is parallel. The "top level" of the worker I think can be serial i.e. not inherit a parallel environment (apart from the considerations we're talking about for communication amongst workers in SCALE-MS).
Does that sufficiently answer the question?
Does that sufficiently answer the question?
I'm trying to establish whether we can and should assume that user / client Python code is effectively isolated from any parallelism in the execution environment, and that the user side workflow client<->dispatcher interface is effectively one-to-one.
Basically, you are saying that this is fine with you, because we are focused on work that is defined(/definable) entirely in a lower-level / run time construct.
That is what I figured. I just want to anticipate any show-stopping use cases or start thinking about how to migrate potential users who are heavily invested in "managing their own parallelism" or some other form of hierarchical work flow.
I'm trying to establish whether we can and should assume that user / client Python code is effectively isolated from any parallelism in the execution environment, and that the user side workflow client<->dispatcher interface is effectively one-to-one.
I think I understand the question, bbut So, there's the simulation code, and the analysis code. The simulation code will almost certainly be parallelized, requiring X nodes. For NOW, I think we can assume that per node parallelization will be specified when the job is queued, and the space for the simulation jobs are requested on the resource, but will not change once that job is launched. I think to start, it will probably be homogeneous over all jobs (workload managed changing the number of homogeneous jobs), but that may change. But still, we can assume that the parallelization is set at the time the job is created and queued.
As to how parallelization is requested - for simulation code, I think we have to accept however the program parallelizes for now - that's not going to gengineerFor OpenMM, the MPI is through mpi4py, Usually, it will be OpenMP or MPI run at the executable level, though.
For the analysis code, we almost certainly need to parallelize it, since there will be large amounts of data and we don't want a long lag before the next set of simulations are launched, and it will usually be in python (maybe python wrapped around some computational core interfaced through a Python API). The question is if the right parallelization is an ensemble of nonparallelized tasks, or task that themselves run MPI. I do think that for now, the behavior of the analysis code can be dictated by scalems if there is a clear better way to do things. For the medium term, there is going to have to be a decent amount of recoding to run a scalems script, so asking them (mostly us for a bit) to use a standard way of doing parallelization will be appropriate.
But probably we should treat analysis code the same as simulation code, it that it has certain parallelization characteristics that are specified at analysis job creation time. Whether that means that analysis has too be run by scripts rather than function calls if it is uses mpi4py is a good question.
Let me know if I'm responding too the right question.
Let me know if I'm responding too the right question.
Yes. Thanks.
Regarding mpi4py: installed packages relying on mpi4py are fairly straight-forward, I think (or at least it is already a well-characterized, well-explored, and high-priority targeted use case). User-provided analysis code using mpi4py should be almost as straightforward when confined to separate module scripts (as we discussed w.r.t. the example scripts), which we can stage as data dependencies.
The user's __main__
script (or interactive session) requires yet a third approach if it is to define local functions that can be used in the work flow. We will probably want to just assume this may not be supported in the initial beta, and will always have some amount of constraints (on things like using mpi4py in such functions), because we would realistically be constrained by Python's ability to serialize functions. (Ref: pickle and cloudpickle)
Responding to Michael (and Eric clarify if I got this wrong):
Eric's idea is that there can be parallelism on the worker and parallelism at the ensemble level. But his proposal is that those levels of parallelism don't interact (i.e. the worker might launch MPI but it doesn't get an MPI sub-communicator from the ensemble).
Correct, Eric? Make sense, Michael?
(Answering purely from the RCT perspective)
The client code is assumed to be a single Python interpreter. (Is this correct?)
I am not sure what exactly 'client code' would be here. An application which uses the RCT python modules would indeed be expected to be in Python - but while we have no specific capabilities to support multi-processes client applications, it might in some cases make sense to have the client side running with multiple interpreters? Not sure, depends on the application I guess.
If 'client code' refers to the tasks we are running on the compute nodes, those are exepcted to be executables (Python, or binary, or anything we can put on a command line really) or function calls which are then called in the context of an RCT worker process (though in an isolated process, i.e., after a fork()
).
Not sure if I missed the point?
Components can (but generally don't) inspect various RP component descriptions to learn more about the allocated resources at run time.
That is correct - we expect tasks to express resource requirements and then consider it the duty of RCT to ensure that tasks are executed on resources which match those requirements.
scripts as standalone processes, such as from a login node or workstation. Is this consistent with everyone's vision and understanding?
This is indeed our default usage mode: the driving application, i.e., the application which is invoked by the user, does indeed run on the login node (or on a laptop remote from the HPC machine, if passwordless ssh access is possible), and resource aquisition, task mapping to resources etc. happens behind the scenes.
For users who are accustomed to managing parallelism within their Python scripts, such as via the Python concurrency modules or mpi4py, we can probably find some combination of documentation, package utilities, and code structure guidelines to minimize dissonance
You may want to also have a look at PARSL which offers a very interesting way to express parallelism and flow control via Python while maintaining the separation of task execution on the HPC resource. As far as language integration goes, it is the best approach I know, by a fai margin. They build on the use of futures and function decorators.
FWIW, we intent to support a mode where parts of a client application run on the login node (or somewhere which is not in the job allocation), but where other parts run as tasks on the compute nodes, and those parts could then also interact with the RCT stack, submitting new tasks etc. The task overlay prototype you looked at is a step in that direction. Having said that, I expect that the cler distinction between client side as driving application and HPC side as tasks serving the client application will remain the dominant case, as it is conceptually much simpler and seems to cover most use cases (we are aware of).
there can be parallelism on the worker and parallelism at the ensemble level
Correct, but my purpose in the current issue is to go further: I want to clarify that we are not trying to preserve use cases that involve coupling between these and higher levels.
The main thing I want to clarify is that we are not expecting users to invoke scripts as mpiexec -n$NUM_RANKS python -m mpi4py my_scalems_script.py
.
In other words, "is the near term scalems usage consistent with the RCT usage (as clarified by @andre-merzky above)?" I think the answer is "yes."
The high-level API for PARSL and gmxapi convergently evolved eerily similar paradigms. They do more to try to serialize user-side functions, but less to allow for typing or decomposable ensemble dimensions. I am told that run time task interaction (ensemble coupling) is not on their road map, nor is any sort of run time side API (CPI) to allow application developers to optimize code in terms of the execution framework. I think that PARSL also dispatches work immediately (all Futures are assumed to be scheduled), without much opportunity for the client to express scoping or grouping, so there are lost opportunities to structure the work load or to shape the run time parallelism in the context of the larger work flow or available resources.
It is conceivable that SCALEMS syntax could be close enough to interoperate with PARSL in various ways, and I'm interested in investigating that, particularly if we like the idea of using cloudpickle to serialize user-defined local functions. Is there currently a bridge between PARSL and RCT?
Is there currently a bridge between PARSL and RCT?
Not yet, but we are engaging and plan to integrate. The most likely route to integration is to use RCT as execution backend for PARSL.
Is there currently a bridge between PARSL and RCT?
Not yet, but we are engaging and plan to integrate. The most likely route to integration is to use RCT as execution backend for PARSL.
That's what I figured. It will have the same venv staging issue as SCALE-MS, but also implies the cloudpickle local-function serialization/deserialization. Otherwise PARSL interaction should be equivalent to a subset of SCALE-MS use cases, which should help clarify the component boundaries.
Hmm... cloudpickle is just ~1000 lines of BSD-licensed code that look pretty much the same as the gmxapi patch I never finished testing. Maybe we should just embrace and embed it now.
As I understand it, the RCT stack assumes that all multiprocessing management is handled by the Pilot framework, including acquiring resources, scheduling HPC jobs through queuing systems, spawning processes, and/or (if appropriate) initializing / finalizing MPI contexts. The client code is assumed to be a single Python interpreter. (Is this correct?) (Components can (but generally don't) inspect various RP component descriptions to learn more about the allocated resources at run time.)
This is worth clarifying because it is notably different from
gmxapi
client code, which allows for the detection of an MPI environment, assuming that the client script may be invoked withmpiexec
(or similar) as (part of) an HPC job submission. Several outstanding design issues would be resolved by making multiprocess resource acquisition to a point after the client script begins execution.We should be as robust as possible to accidental simultaneous invocations of multiple client processes (e.g. user attempts to launch their workflow with
mpiexec -n2 python my_scalems_script.py
) but maybe we can assert that the principal supported usage is to invoke scripts as standalone processes, such as from a login node or workstation.Is this consistent with everyone's vision and understanding?
I think it could be confusing or inconvenient (particularly for users migrating from other modes of work management) to rely on config files or script contents to specify resources at run time (such as mpiexec options), but that could be resolved satisfactorily and easily enough by including a command line option parser that is applied by the package internally as appropriate.
For users who are accustomed to managing parallelism within their Python scripts, such as via the Python concurrency modules or
mpi4py
, we can probably find some combination of documentation, package utilities, and code structure guidelines to minimize dissonance. For instance,mpi4py
could be imported within a function scope and operate on a communicator acquired through thescalems
interfaceCan anyone think of important cases in which
mpiexec
) rather than deferring to the execution dispatching layer?scalems
package isimport
ed)Issue resolution
If no one can think of any concerns, I expect I can resolve this issue soon with a few sentences in the wiki or docs, and some scattered annotations in the package sources.