Run Scheduler and Client from the same MPI Rank

guillaumeeb commented 5 years ago

I believe that in some simple cases, we don't need to have a rank dedicated to the Scheduler, and a rank dedicated to the Client. We should provide a way to start both within the same MPI rank in https://github.com/dask/dask-mpi/blob/master/dask_mpi/core.py#L62-L71.

Thoughts?

mrocklin commented 5 years ago

Seems fine to me.

On Mon, Mar 11, 2019 at 1:25 PM Guillaume Eynard-Bontemps < notifications@github.com> wrote:

I believe than in some simple case, we don't need to have a rank dedicated to the Scheduler, and a rank dedicated to the Client. We should provide a way to start both within the same MPI rank in https://github.com/dask/dask-mpi/blob/master/dask_mpi/core.py#L62-L71.

Thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-mpi/issues/29, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszK-hXbWnE093kNxNhEOH1LLXd9Roks5vVru5gaJpZM4bpfR9 .

kmpaul commented 5 years ago

@guillaumeeb Sorry for the delay!

I think this is entirely true. And it's an excellent idea! In fact, I would like to go a little further than this. I would like to make the launch algorithm work for any size MPI communicator, not just > 2!
I am considering an optional parameter like separate_scheduler_rank to place the scheduler on a rank different from the client rank.

Here's what I'm considering

If the MPI communicator size is 1, then place the scheduler and a single worker on the same rank as the Client (rank 0). This is the "LocalCluster" option. The separate_scheduler_rank option is ignored.
If the MPI communicator size is 2, then place the scheduler on the same rank as the client (rank 0) and start a worker on the other rank (rank 1). Again, the separate_scheduler_rank option is ignored.
If the MPI communicator size is 3 or more, then place the client on rank 0, the scheduler on rank 0 or rank 1 (depending on separate_scheduler_rank), and the workers on all of the other ranks.

Is this reasonable?

guillaumeeb commented 5 years ago

If you think about use case where the MPI rank would be less than 3, then yes. I assumed that there was not really such a use case... But maybe for algorithm testing or debug purpose?

Otherwise I think the idea of an optional parameter is perfectly fine!

betatim commented 1 year ago

I'd like to reboot this issue :D

Is there a reason to not place a worker on each MPI rank? For example if the MPI size is 1: place worker, scheduler and client on rank 0. If the MPI size is >=2: scheduler on rank 0, client on rank 1, workers on all ranks.

If that isn't a fundamentally bad idea I'd propose adding a flag (maybe called workers_on_all_ranks=True?) that leads to this new behaviour. With a default of False so that nothing changes for current users.

kmpaul commented 1 year ago

Thanks for rebooting this, @betatim!

I have some cycles to devote to this today, so the timing is great!

I've been thinking about how to give the user more explicit control about placement for client, scheduler and workers within the MPI cluster. To aid with this, I've been thinking about adding the CLI options:

scheduler_rank: The MPI rank of the scheduler. The default value is 0.
client_rank: The MPI rank of the client, if being used in a stand-alone MPI script. If the MPI cluster size is 1, then the client rank will be 0. If the MPI cluster size is 2 or more, then the default value is 1.

They could be the same, and they could be different from 0/1, but they have sensible defaults (chosen for backwards compatibility).

I think the client and scheduler are the easy bits, though, because there are only ever 1 of each of them.

The workers are the harder bits. I think the default behavior (again, backward compatibility) is for workers to "fill" the remaining MPI cluster, one worker per MPI rank. But it's the idea of more than one client/scheduler/worker per MPI process that we are discussing, I think. And if we allow the client and scheduler to run on the same MPI rank, then why not multiple workers?

In terms of implementation, most MPI implementations that I am aware of will not allow multiple processes to be associated with the same MPI rank (and things like forking/spawning new processes from an MPI process can actually be blocked). I do not know if some MPI implementations will allow it, or if it is safe even if possible, but regardless, the general approach to MPI is that each rank should (ideally) be a single process. That means that to get multiple client/scheduler/workers to run on a single MPI process will require running them asynchronously in a single process, which is very doable from a "I know how to code that up" perspective... but I don't know if it will be a stable solution.

Does anyone listening in on this conversation have thoughts on this? In this time zone, maybe @guillaumeeb @jacobtomlinson?

Of course, I don't have a problem with implementing something that gives the user the ability to create an unstable instantiation of a Dask-MPI cluster if it also gives the user more flexibility. However, I don't want to create general instability for other users who aren't aware of the complexity and subtlety.

kmpaul commented 1 year ago

In terms of a user CLI to allow complete control over worker placement, the complete solution would involve asking the user to specify a list of one int per MPI rank, with the value of the int indicating the number of workers to launch for the corresponding rank.

That is, something like:

workers_per_rank = [0,0,1,1,1,2]

would describe launching 0 workers on ranks 0 and 1, 1 worker on ranks 2, 3, and 4, and 2 workers on rank 5. ...But for large clusters, this is untenable.

One possible solution would be to assume that the last int in the list is "repeating" for all additional ranks to the size of the MPI cluster. For example, one might specify:

workers_per_rank = [0, 0, 1]

which would imply not launching a worker on ranks 0 and 1 and only launching a worker on ranks 2 or higher. That's the default behavior right now.

kmpaul commented 1 year ago

So, currently Dask-MPI in the stand-alone script mode only works if you have an MPI comm size of 3 or larger. This proposed change would make it possible to run Dask-MPI with MPI comm sizes 1 or 2, for which we would need to define sensible defaults. Here are my suggestions:

MPI comm size 1:

There is only 1 possible choice here, so it doesn't really need to be said. However, just to be pedantic: scheduler on rank 0, client on rank 0, 1 worker on rank 0.

MPI comm size 2:

I think the default arrangement should be client on rank 0, scheduler on rank 0, and worker on rank 1. But I'm open for comment here.

MPI comm sizes 3+:

The normal defaults should exist with scheduler on rank 0, client on rank 1, and workers on all other ranks.

betatim commented 1 year ago

My desire to have a worker on the ranks 0 and 1 is that I don't know enough about the scheduler/job submission system I am using to request different types of nodes per rank. So I end up with a gigantic machine with eight GPUs for each rank. Given that I have to queue a while to get the allocation I'd like to use all of it :D Of course an alternative solution would be to figure out if it is possible to receive a smaller node for rank 0 and 1 (a bit more like you'd do in the cloud).

This means for me it is all about "workers on all the ranks". I care less/know nothing about how the scheduler and client are distributed amongst the available ranks.

I also don't use the CLI but call initialize() within my code. But I assume if a CLI option is added there will also be a way to do the same thing via initialize.

I hadn't considered the restrictions on starting new processes and such (can you tell I am a MPI newbie?). I had naively assumed that we'd just add more stuff to the event loop and that would be that. I am +1 on making it hard for people to configure things in a way that makes running unstable. My assumption is that most people "have no clue", so like me they'd just see a wasted node and see that they can configure it to do work and do it. And then have a miserable life because sometimes things work and sometimes they don't. I think it would be better to not give people that rope to hang themselves by and write a short bit of documentation explaining why you can't configure a particular setup.

For "why not more than one worker": I learnt that the word "worker" is overloaded. For example when I instantiate one CUDAWorker I will end up with as many workers (in the sense of entries in the scheduler information) as there are GPUs in the system. From that point of view I think we don't need more than one worker (aka CUDAWorker instance) per rank. From what you said above about "more than one process per rank is often not allowed" it also sounds like in the CPU setting you'd only want one worker per rank?

kmpaul commented 1 year ago

My desire to have a worker on the ranks 0 and 1 is that I don't know enough about the scheduler/job submission system I am using to request different types of nodes per rank. So I end up with a gigantic machine with eight GPUs for each rank. Given that I have to queue a while to get the allocation I'd like to use all of it :D Of course an alternative solution would be to figure out if it is possible to receive a smaller node for rank 0 and 1 (a bit more like you'd do in the cloud).

Understood! Then I would lean towards making the "run workers everywhere" option easy to implement, at least.

I also don't use the CLI but call initialize() within my code. But I assume if a CLI option is added there will also be a way to do the same thing via initialize.

Yes. I think the cli option is actually the easy option. If we find a solution that works for the initialize() workflow, then the CLI implications will be natural.

I hadn't considered the restrictions on starting new processes and such (can you tell I am a MPI newbie?). I had naively assumed that we'd just add more stuff to the event loop and that would be that. I am +1 on making it hard for people to configure things in a way that makes running unstable. My assumption is that most people "have no clue", so like me they'd just see a wasted node and see that they can configure it to do work and do it. And then have a miserable life because sometimes things work and sometimes they don't. I think it would be better to not give people that rope to hang themselves by and write a short bit of documentation explaining why you can't configure a particular setup.

Yes. I think that being explicit here would benefit everyone. Then, at least, people will have a better idea of what to expect before hand.

That said, MPI is always confusing for many people. Dask-MPI was created out of a simple abstraction that attempted to make launching a dask cluster in an MPI environment easy, but I think that just opened the trail to a steep climb for many people. Once you're in the door, MPI can be a labyrinth! (And that's not even going into the complications created by each implementation of MPI being a little different from each other...and from the MPI standard itself!)

kmpaul commented 1 year ago

For "why not more than one worker": I learnt that the word "worker" is overloaded. For example when I instantiate one CUDAWorker I will end up with as many workers (in the sense of entries in the scheduler information) as there are GPUs in the system. From that point of view I think we don't need more than one worker (aka CUDAWorker instance) per rank. From what you said above about "more than one process per rank is often not allowed" it also sounds like in the CPU setting you'd only want one worker per rank?

Yes. You've hit on exactly the issue, I think. That is why the initial implementations of Dask-MPI have always worked in "units of 1". And I think (because of this overloaded "worker" word) it is safe to continue assuming only 1 worker per rank.

~~If that were the assumption, then I would modify my CLI/parameter suggestion above to replace the workers_per_rank option with a workers_skip_ranks option like so:~~

~~workers_skip_ranks = [0, 1, 5]~~

~~which would imply that workers would not be launched on ranks 0, 1 and 5.~~

Changed my mind. I think it actually makes sense to just adopt your notion of whether to allow a worker on the same rank as the client/scheduler. This is the notion of worker exclusivity, or whether a worker has exclusive access to an MPI rank. Would it be sufficient to just have a boolean exclusive_workers option? If exclusive_workers=False then workers would be launched on the same ranks as the client and scheduler. If exclusive_ranks=True, then the default behavior of "no workers on client and scheduler ranks` would apply.

jacobtomlinson commented 1 year ago

I think this makes a load of sense. Especially on large nodes.

Thinking purely from a GPU perspective there are a bunch of different configurations I would love to have (assuming the rank 0 node has 8 GPUs):

Give the Client and Scheduler their own GPU, populate the other 6 with workers
Give the Client and scheduler a shared GPU, populate the others with 7 workers
Start 8 workers, have the client share with a worker and the scheduler share with a different worker

I'm also playing around with https://github.com/jacobtomlinson/dask-hpc-runner which is based on https://github.com/dask/distributed/pull/4710 and https://github.com/dask/dask-mpi/pull/69. My main goal here is to make the dask-mpi deployment paradigm (populate an existing allocation with a Dask cluster) easier to do in non-MPI scenarios.

My motivating use case is SLURM where we have env vars to get the ranks and a shared filesystem to communicate the scheduler address. It would be nice to leave MPI out of things altogether because there can be consequences to importing mpi4py especially in GPU workloads.

There are other places like Databricks where it would also be useful to have this paradigm for side-loading a Dask cluster onto a Spark cluster.

kmpaul commented 1 year ago

@jacobtomlinson: Oooh! Excited to hear about progress on the dask-hpc-runner. Been interested in that for a while. And I agree that the use cases are compelling.

Thanks for chiming in on these suggestions. It sounds like the 3 bulleted configurations you described would be satisfied by the options I suggested above. I'm working on a PR now.

betatim commented 1 year ago

Good point about assigning one of the GPUs (each) to the scheduler and client.

kmpaul commented 1 year ago

Ok. I've been playing around with this today and there are solutions, but I don't think they are as pretty as anyone was hoping. So, I'm sharing my experiences in the hopes that someone might have better ideas than I have.

First, it's easy to launch a worker on the same MPI rank as the scheduler. The scheduler and worker are both asynchronous tasks, and therefore they can both be run on the same asynchronous event loop, which is started with the asyncio.run_until_complete() function.

However, the complication comes from the fact that the client code is synchronous. And launching an asynchronous task inside of synchronous code can only be done by starting an event loop with either asyncio.run_until_complete() or asyncio.run_forever(), both of which are blocking until the asynchronous tasks are canceled or completed. If we were to try to launch the scheduler in the same process as the client, the asyncio.run_until_complete() function would block the execution of the client code indefinitely because the shutdown signal would never be sent (because it is sent by the client code which is blocked).

The only way to resolve this issue is for the event loop that runs in the client process to either: (1) start an event loop in a separate thread or (2) start an event loop in a separate process (i.e., fork or spawn). For option 2, this violates the 1 process per MPI rank rule that I mentioned above, so I think we should exclude it from consideration. For option 1, however, this means starting a thread pool in the client code, starting the event loop in a separate thread, continuing execution of the main thread (i.e., the client code) and then shutting down the thread when the event loop is complete.

Option 1 is technically doable, but I feel like it is messy, and running processes on MPI ranks that randomly grab extra threads without being explicit about it seems like a way of creating problems for users later on. ...But it's possible.

What do people think about this?

jacobtomlinson commented 1 year ago

This is a fun rabbit hole. I spent a ton of time thinking about async-in-sync, sync-in-async, sync-in-async-in-sync, etc recently.

To clarify the client code can either be sync or async, but we have no control over how the user decides to implement things so we have to support both.

If we are in a situation where we want to start more than one component per rank it's probably because each node is big and a single rank refers to a substantial machine. Do you think it's fair to say in these cases that the "one process per rank" rule will not be enforced? It seems bonkers for an HPC admin to enforce a single process on a machine with 8 GPUs. How on earth do they expect to saturate that hardware?

kmpaul commented 1 year ago

I don't think it is true that one MPI rank typically equates to one machine/node. In mixed threading/processing jobs, one MPI rank typically equates to one machine, with the assumption that the cores on the machine will be saturated with the appropriate number of threads. In plain CPU multiprocessing jobs, multiple MPI ranks are assigned to a single machine, typically one MPI rank per core. Obviously, it depends on how the scheduler is configured,but all of my HPC experience has followed that rule.

kmpaul commented 1 year ago

To be clear about this, MPI typically doesn't decide where to place its ranks. Or rather, the user does not usually need to tell MPI where to place its ranks. This can be done with something called a machinesfile, which lists the names of the machines on which to launch MPI ranks. Such a machinesfile can look like this:

node001:16
node002:16
node003:16
node004:16

which would tell MPI to launch 16 ranks on node001, 16 ranks on node002, etc.

However, in practice, this is almost never done because this process is handled by the HPC resource manager, like Slurm, LSF, PBS, etc. Depending on what you request from the resource manager, it will essentially construct a machinesfile to place multiple MPI ranks across your hardware. Different queue policies can be used to enforce things like "nodes need to be full" (i.e., 1 MPI rank per core) or "nodes can be shared across multiple jobs" or similar.

Like I mentioned above, some MPI implementations will actually crash if you try to spawn or fork a process from the process assigned to an MPI rank. While some MPI implementations may allow it, it is generally considered best practice with MPI to assume 1 process per MPI rank.

And everything I've just described suggests the dilemma with using threads. The HPC user expects MPI ranks to be placed in a way that uses the entire requested resource (i.e., "fills the nodes"). This gets hard when some processes are creating threads that the user didn't know about when originally requesting the resources.

But you make an excellent point @jacobtomlinson about the fact that the client code can be sync or async. This problem goes away entirely if the client code is async because then the client code could be run in the same async event loop as the scheduler and/or worker. However, to make it possible to run async client code, it seems to me that initialize methodology needs to change. Currently, its mode of operation is via a synchronous "pass-through function" like so:

from dask_mpi import initialize
initialize(...)  # <-- Start scheduler and workers for MPI ranks other than the client MPI rank

# Client code starts here...

where the client code is executed only on the client MPI rank because the worker and scheduler MPI ranks block due to their running event loops. But if the client MPI rank is the same as the scheduler MPI rank, then the scheduler's event loop will block and prevent the execution of the client code. To make it possible to run this asynchronously, one would need to change this to something like:

from dask_mpi import initialize

async def client_code():
    # Client code starts here

initialize(..., coroutine=client_code())

where the initialize method takes the client_code() coroutine and runs it on the same event loop used for the scheduler and/or worker...and only on the client MPI rank.

I don't want to spend my whole day just writing up my thoughts/notes, though. So, let me investigate a bit about how one might change the Dask-MPI API to make this kind of operation possible.

jacobtomlinson commented 1 year ago

Thanks for the info, I've only ever used machines where node==rank so this is good to know and definitely affects the design here.

I'm very keen to move away from the initialize function because it is all sorts of magic and is often a point of confusion for users. In dask-hpc-runners I'm playing around with using context managers instead, which still feels magic, but less-so. However that does open you up to async with a lot more easily.

My big concern about letting user code and Dask code share an event loop like this is that if a user accidentally does any blocking code they will very likely run into deadlocks that will be painful to debug. I think it would always be safer to start a new loop on a separate thread and run the Dask components there to avoid them getting deadlocked by user code on the main event loop.

kmpaul commented 1 year ago

@jacobtomlinson: Thanks for the reply! I agree that there is enough "magic" going on here that I don't want to charge forward with anything that makes things worse.

You are correct about the accidentally blocking user code scenario. I hadn't really thought about that, but it is a problem. And it would be a nightmare to diagnose. I think you are correct that always executing the Dask code in its own event loop is the safest bet. I'll put a crude template in place in the form of a PR and then we can discuss further.

betatim commented 1 year ago

One thought based on a conversation with @jacobtomlinson: maybe the solution isn't to try and run scheduler and client on the same rank but instead to run more than one rank per node. This would solve my problem of "wasting" a whole node on the scheduler and one more on the client. It might be differently complicated to achieve this, but wanted to mention it.

I think for the queue system/cluster I am using I can set --ntasks-per-node which I assume influences the number of MPI ranks per nodes?!

kmpaul commented 1 year ago

Ok. I'm working on a solution to this, but I think it needs to change the fundamental way that the dask_mpi.initialize() function works. In fact, it's so large of a change, I think it needs to be a completely new function, maybe called dask_mpi.execute(). The idea would be this:

The client code would need to be "wrapped" in a function. It could be an asynchronous function, but it would still need to be wrapped in a function regardless.
The dask_mpi.execute(...) function would take the client function as input (or the coroutine) and execute it on the specified client_rank with all of the additional options (e.g., scheduler_rank, exclusive_workers, ...).

To achieve the current Dask-MPI behavior, where workers, scheduler, and client are all in separate MPI ranks, the execute() function might look like:

def execute(func, *args, **kwargs):
    comm = mpi4py.MPI.COMM_WORLD
    this_rank = comm.Get_rank()

    if this_rank == 0:
        async def run_scheduler():
            async with Scheduler(...) as scheduler:
                comm.Barrier()
                await scheduler.finished()
        asyncio.run_until_complete(run_scheduler())

    elif this_rank == 1:
        comm.Barrier()

        ret = func(*args, **kwargs)

        with Client() as client:
            client.shutdown()

        return ret

    else:
        comm.Barrier()

        async def run_worker():
            async with Worker(...) as worker:
                await worker.finished()

        asyncio.run_until_complete(run_worker())

There would be an obvious modification for func being a Coroutine object to run func asynchronously.

And the above modifications to make it more customizable and make MPI rank placement easier would be easy to implement, too.

I particularly like this approach for a number of reasons:

It is much more explicit, and it removes a lot of the previous Dask-MPI "magic" which, as @jacobtomlinson has already pointed out, makes it hard for new users to use Dask-MPI and diagnose problems with Dask-MPI when they have them.
The dask_mpi.execute() function is essentially a decorator, which is an appropriate paradigm to fit the function of Dask-MPI (as opposed to a context manager).
It makes it easy to run the client code in a thread to prevent collision with a scheduler or worker event loop.

What do people think?

jacobtomlinson commented 1 year ago

Using a decorator is an interesting idea. I'm not sure the code above would behave quite right though, a decorator expects to return a function, not the return value of the function. Everything in that example would be called at the time of decoration, not when the user function is called.

I wonder if the user's function should expect to take the client as an argument too?

kmpaul commented 1 year ago

Sorry for the delay.

Yeah. It's not a decorator yet. But it could be easily transformed into one.

As for taking the client in as input, I would assume that is the user's choice. The user could pass it in or create it inside the custom function.

kmpaul commented 1 year ago

Ok. I'm looking at this a bit today, again. First, I have created a mock-up of the solution I described above in https://github.com/dask/dask-mpi/pull/110. I'd appreciate any comments on that solution in a review, if you have time.

In #110, the execute() function wraps the user's "client" code in a function and runs that function in its own thread on the MPI rank selected (client_rank). Because the client code executes in its own thread, all of the rest of the Dask-MPI functionality can be executed "normally", without interference. The user can customize the placement of the scheduler with scheduler_rank, and the user can force placement of workers on all MPI ranks (regardless of where the client and scheduler are located) with the exclusive_workers=False setting.

kmpaul commented 1 year ago

Now that I've had some experience playing around implementing #110, I've discovered a new way of accomplishing the same things. And I like this approach even more.

The original intent behind the initialize() function was to make it so that a user could run a client script (inside which Dask code exists) with a Dask cluster created over the MPI ranks. Originally, the idea was that each component--scheduler, client code, and workers--would run on its own MPI rank. With this issue, however, we have started discussing overlapping components and running multiple components in the same MPI rank. Since the Dask components (worker and scheduler) are asynchronous, this is possible in the same process. And since the client code can always be run in a separate thread, it is possible to run all three components in the same process, too. However, doing this inside initialize() function, as it was structured, was not possible.

Now, with #110, we can do all of this with the execute() function (see above), but it required changing the way Dask-MPI operates. Namely, you have to submit a function to execute().

Now, this brings me to something new. If we exploit Python's built-in runpy.run_path function, we can actually go back to the idea of just "running a script" with a Dask cluster built over the MPI ranks. Basically, what this would mean is going from:

def my_client_code():
    # define client code here

dask_mpi.execute(my_client_code, **execute_kwargs)

to

import runpy

dask_mpi.execute(runpy.run_path, my_client_script, **execute_kwargs)

where my_client_script is a stand-alone script containing the code inside the my_client_code() function.

In other words, the use of dask_mpi.execute would change from:

create a script called my_dask_mpi_script.py with:

import dask_mpi

def my_client_code(*args, **kwargs):
  # define client code here

dask_mpi.execute(my_client_code, *args, ..., **kwargs)

and run this script with:

mpirun -np N python my_dask_mpi_script.py

to the new way of running things:

create a script called my_dask_mpi_script.py containing only the client code:
```
# define client code here
```
and run this script with:
```
mpirun -np N dask-mpi --script my_dask_mpi_script.py
```
where the dask-mpi --script my_dask_mpi_script.py option would run dask_mpi.execute(runpy.run_path, "my_dask_mpi_script.py").

This would unify the batch-mode Dask-MPI operation with the interactive (CLI-based) operation. The dask-mpi CLI tool would "do everything", allowing the user to either launch a Dask-MPI cluster from the command-line, or run a client script in batch-mode.

betatim commented 1 year ago

Today I can call initialize from somewhere in my code, or not use it and do something else (say LocalCUDACluster). In my case my code is somewhere inside some "framework" that has its own CLI, does setup and bookkeeping stuff, etc. Somewhere in that "framework" code my code gets called "hey, time to get to work, do your thing now!". At this point I want to setup my cluster and do "my thing". This means it is super useful to have a function I can call, instead of it being the other way around with mpirun -np N dask-mpi --script my_dask_mpi_script.py.

I think mpirun -np N dask-mpi --script my_dask_mpi_script.py looks really nice and it solves a problem we've discussed above regarding naming/control flow/etc. I just think there are enough instances where the user wants to "own the __main__" instead of dask-mpi owning it.

kmpaul commented 1 year ago

@betatim I agree with you entirely. My intention is to make the execute() function part of the public API of the dask_mpi package. So, you will still have an "initialize()-like interface, but I'm not sure that is what you are asking for.

At this point I want to setup my cluster and do "my thing". This means it is super useful to have a function I can call, instead of it being the other way around with mpirun -np N dask-mpi --script my_dask_mpi_script.py.

This is really the tricky point, because if you want to use MPI, then you must execute your MPI program with mpirun/mpiexec. Even if you try to use the MPI dynamical capabilities such as Spawn, you need to launch the main program with mpirun or mpiexec. That is just how MPI works.

So, for your case, if you have an external framework that (I assume) is not an MPI-based framework, and you want to use Dask-MPI to launch a cluster "on demand" from within the framework, then the only way to do it is to execute a subprocess (e.g., subprocess.Popen(["mpirun", "-n", str(N), "my_mpi_program"])). I have searched for ways to skip the need for mpirun/mpiexec for a long time, but there just isn't a way out. At least not right now (though maybe I should check again with the most up-to-date versions of MPI and mpi4py).

Even Dask-Jobqueue uses subprocess to submit a job to a resource manager (PBS, LSF, etc.).

If anyone knows of a way to get around the need for mpirun and mpiexec and to do what you are looking for, @betatim, I am open to implementing a solution. For I have wanted a solution like that for a long time, too!

kmpaul commented 1 year ago

Incidentally, the mpirical package (https://github.com/kmpaul/mpirical) is the closest success I've ever had at trying to accomplish something like "run this function with MPI". It still uses mpirun or mpiexec under the hood, and tries to pass pickled objects back and forth between the processes so that return values can still be retrieved (even if the MPI code was run in another process). It's not perfect, though, which I think is part of the reason why it just stagnated.

betatim commented 1 year ago

This is really the tricky point, because if you want to use MPI, then you must execute your MPI program with mpirun/mpiexec. Even if you try to use the MPI dynamical capabilities such as Spawn, you need to launch the main program with mpirun or mpiexec. That is just how MPI works.

Agreed that you need to use mpirun, and I do. The thing I was trying to say is that I don't want to mpirun -np N dask-mpi --script my_dask_mpi_script.py - here dask-mpi "owns the main". Instead what I'd like to do is mpirun -np N python some_entry_point.py --foo 42 --bar 21 make_it_so - here some_entry_point.py "owns the main" and eventually calls the code in which I want to use execute() to setup the dask cluster.

kmpaul commented 1 year ago

Great. Then I understand you. And that should be doable with the execute() function. I hope that is satisfactory.

dask / dask-mpi

Run Scheduler and Client from the same MPI Rank #29