Add the `ExistingProcessManager` cluster manager

DilumAluthge commented 3 years ago

Summary

This pull request adds the ExistingProcessManager cluster manager. Because this is a non-breaking feature, I have bumped the version number from 0.4.0 to 0.4.1.

Example usage

First, run the following command four times. It can be on the same machine or on different machines.

julia --worker=1234567890abcdef &

Make note of the hosts and port numbers that are printed to stdout. For this example, we will suppose that the hosts and port numbers that were printed to stdout were as follows:	Host	Port
192.168.1.151	9684
192.168.1.151	9685
192.168.1.151	9686
192.168.1.151	9687

Now, open a Julia session and run the following: (in the workers array, replace the hosts and port numbers with the hosts and port numbers that you received in the previous step)

julia> import ClusterManagers, Distributed

julia> Distributed.cluster_cookie("1234567890abcdef")

julia> workers = [
       ("192.168.1.151", 9684),
       ("192.168.1.151", 9685),
       ("192.168.1.151", 9686),
       ("192.168.1.151", 9687),
       ]

julia> Distributed.addprocs(ClusterManagers.ExistingProcessManager(workers))

Motivation

The idea here is that for whatever reason, you have already manually started the Julia worker processes by doing julia --worker=my_cluster_cookie. Those workers are now running, and you want to add them as processes so that Distributed is aware of them. This is what the ExistingProcessManager cluster manager allows you to do.

Constructors

There are two ways to construct a ExistingProcessManager:

Pass a vector of tuples, where the first element of each tuple is the host, and the second element of each tuple is the port number.
Pass a vector of Distributed.WorkerConfigs.

juliohm commented 3 years ago

Thank you @DilumAluthge for the contribution. However, I feel that this feature is outside of the scope of the package. Perhaps it deserves a separate package?

We are trying to revamp the ClusterManagers.jl package, and as part of this revamp we want to clarify the message that the package was designed for use cases where one requests resources a priori with the manager object itself. A related discussion can be found here: https://github.com/JuliaParallel/ClusterManagers.jl/issues/145

So use cases where resources are gradually requested or requested from outside the ClusterManagers.jl managers don't fit here.

DilumAluthge commented 3 years ago

Sure, this can go in a different package. Do you know of any packages for which this feature would be suitable?

It's such a small piece of code that it's not really worth making a new package for it, but I can do that if there isn't another suitable home.

juliohm commented 3 years ago

It seems like a utility function part of a larger set of utilities one could wish for working with multiple workers. I am not aware of any specific package, but I remember seeing a package like ClusterUtils.jl or something days ago.

Thanks for understanding and considering the contribution in the first place. :)

bjarthur commented 3 years ago

i disagree. this PR fits well within the scope of ClusterManagers.jl. it's not so different than the ElasticManager. https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/src/elastic.jl

bjarthur commented 3 years ago

note also that "gradually requested" resources also is part of the workflow of ClusterManagers.jl. one can repeatedly call addprocs. workers don't have to be instantiated all at once.

bjarthur commented 3 years ago

note also the docstring for addprocs specifically describes how to asynchronously add workers to the pool. https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.addprocs

juliohm commented 3 years ago

@bjarthur I am not familiar with the ElasticManager, but does it assume that the workers have been requested from outside ClusterManagers.jl? I think we need to nail down the scope of the package because otherwise we will continue to have this as a work in progress with a multitude of different approaches, none of them working smoothly.

Do you think we should allow multiple addprocs calls? That is a valid question as well. The workflow in HPC clusters isn't usually like this. As @vchuravy pointed out in the issue, we can't proceed in most cases unless we have all the resources reserved a priori.

juliohm commented 3 years ago

The entire package is outdated, including docstrings. I was planning a major refactoring of the code, but got swamped with other priorities in my TODO list.

bjarthur commented 3 years ago

that docstring is julia base, not clustermangers, and has been there for years.

DrChainsaw commented 3 years ago

I'm also confused. This feature sounds like it does exactly this:

We are trying to revamp the ClusterManagers.jl package, and as part of this revamp we want to clarify the message that the package was designed for use cases where one requests resources a priori with the manager object itself

This is basically the only manager you need as all other managers are in clear violation to this target. I can' see how it can get more a priori than launching the workers before you even start the julia session.

juliohm commented 3 years ago

Sorry, I thought you were mentioning docstrings in the package here. I really think that we need to narrow the scope of this package to make it work better, and only then consider alternative approaches. The variability in implementations and workflow models is causing more harm than help.

bjarthur commented 3 years ago

let's re-open this pull request so we can consider it as a group. there are multiple people who have write/push access (myself included) who might want to weigh in.

juliohm commented 3 years ago

@DrChainsaw the workflow me and @vchuravy were describing in that issue and that you and @bjarthur seem to dislike is:

using ClusterManagers

addprocs_foo(N, flags=gpus, memory, etc) # wait for the scheduler to kick in

# do the heavy simulation work with all the resources already avaiable

# finish the job as a whole.

juliohm commented 3 years ago

So this idea of multiple addprocs doesn't seem like a good fit here, at least to me it feels like an exotic use case where the user wants to update the pool of processes multiple times during the execution of a script.

juliohm commented 3 years ago

I think we need to actually continue the discussion in #145 to clarify the scope of ClusterManagers.jl and the workflows we intend to support. My vote is to really write down a formal description of what the package is supposed to do as opposed to trying to accommodate multiple workflows at once. The status of this package is already concerning, and it will only get worst as we add more managers without active maintainers with a shared vision.

bjarthur commented 3 years ago

that's fine. but please keep in mind @juliohm that you are a relative newcomer here. please give all the other stakeholders a chance to weigh in [EDIT before dismissing PRs like this so quickly)

bjarthur commented 3 years ago

@DilumAluthge please resubmit this PR so it can be considered collectively

juliohm commented 3 years ago

I fully agree @bjarthur with weighing in different points of view. I am just concerned with the package as a whole, and think that this PR that we are considering here is in the wrong direction of improving the current state of things.

Sorry I closed the PR without waiting more for others to jump in. No one seemed to be listening to this repo in a while so I assumed I could make decisions more quickly by myself.

DrChainsaw commented 3 years ago

@juliohm Just want to clarify that I don't have anything against that workflow at all.

I think that this PR supports the intended workflow very well.

In case it was confusing what I was arguing about in #145 then it was a workflow like this

approcs_t =  @async addprocs_foo(N, flags=gpus, memory, etc)
dothework_t = @async startprocessing(...)
while !istaskdone(dothework_t)
     # look for new workers in workers(), set them up with @everywhere and add them to the workerpool
end
fetch(addprocs_t) |> rmworkers

I'm fully aware that this is a hack and I'm totally fine with not adding extra functionality to this package to further support this workflow. Just added it here as a clarification.

Edit: In case it is not clear, the proposed functionality does in no way support my hack workflow with async. I'm just trying to be cooperative here :)

juliohm commented 3 years ago

Thank you all for the discussion here. Let's draft a "formal" specification of the workflow we intend to support in #145, and then we can come back to all the available managers and refactor the codebase to a single manager with the full logic implemented in a manner that is agnostic to specific cluster tools (e.g. bsub, srun).

DilumAluthge commented 3 years ago

@DilumAluthge please resubmit this PR so it can be considered collectively

I have resubmitted this PR here: https://github.com/JuliaParallel/ClusterManagers.jl/pull/148

DilumAluthge commented 3 years ago

@DrChainsaw the workflow me and @vchuravy were describing in that issue and that you and @bjarthur seem to dislike is:
using ClusterManagers

addprocs_foo(N, flags=gpus, memory, etc) # wait for the scheduler to kick in

# do the heavy simulation work with all the resources already avaiable

# finish the job as a whole.

Just to clarify, the ExistingProcessManager is designed for exactly this workflow.

DilumAluthge commented 3 years ago

multiple addprocs

To clarify, when using the ExistingProcessManager in a typical use case, you will only call addprocs once, at the beginning of your script.

The workflow is:

Start the Julia workers. Maybe you do this manually, or maybe this is done by some resource scheduler. Either way, you end up with a list of Julia worker processes in the form of (host, port_number) tuples.
You call addprocs(ExistingProcessManager(workers)) one time at the beginning of your script. This tells Distributed about your list of Julia worker processes.
You do your computation.

The set of Julia worker processes does not change dynamically throughout the script. You set your workers once at the beginning of the script, and then the set of workers remains constant for the remainder of the script.

JuliaParallel / ClusterManagers.jl