flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

how to generate a "hostfile" for a flux instance or allocation #1489

Closed grondo closed 6 years ago

grondo commented 6 years ago

@trws brought up in a meeting that it would be nice if there was a user facing utility to dump a hostfile for use with other launchers for a flux instance or allocation.

This could be a new utility (e.g. flux hostfile ARGS...) or maybe a hostfile or files written to a standard kvs location.

What I'm not sure about is:

trws commented 6 years ago

For a bit more context, the current use-case for this is always going to look something like this:

flux submit -N <foo> -t 1 flux broker <script>.sh

Where <script>.sh is something like this:

HOSTFILE=$(<get_hostfile_path_somehow>)
mpirun -hostfile $HOSTFILE thing_flux_will_not_launch
#or
mpiexec -f $hostfile ...

I'd say unless we change some of the constraints on wreck and company, it's a per-instance thing for sub-instances like this, since this is the only way currently to run a single task in a multi-node allocation.

garlick commented 6 years ago

To confirm: the main problem we are solving is to enable the use of a non-flux launcher to launch MPI jobs under Flux? (e.g. the sierra mpirun?)

Is there a way to avoid hostnames by coercing mpirun/mpiexec to use flux exec instead of ssh or rsh (with an option or environment variable), and then make a hostlist that maps Flux ranks to mpi ranks?

trws commented 6 years ago

Not necessarily MPI jobs, but that's probably the most frequent use-case yes.

I like that idea as something we could do in addition, but I'm not sure it's always an option. For reference, both of the main launchers have an option to set the binary they invoke as "rsh" to whatever you want. So if we had something that could masquerade as rsh, yes that would work. Otherwise, it's pretty complicated to retrofit anything like that in a general way unfortunately. Hydra does have a "manual" option that would let us do it, but it would be substantially more work than generating a hostlist.

garlick commented 6 years ago

We could provide a flux rsh wrapper for flux exec, e.g.

#!/bin/bash
if test $# -lt 2; then
    echo "Usage flux-rsh rank command ..." >&2
    exit 1
fi
hostname=$1; shift
exec flux exec -r ${hostname} $@

If that could be used more generally in place of rsh or ssh, then there should be no need to map ranks to hostnames, and it leverages the already wired secure overlay.

trws commented 6 years ago

I personally like that idea a lot, at least if it provides the two-way pipe for stdin/out that rsh normally does.

Thinking of how I would explain to a user how to use that is another matter though. The openmpi mpirun line would look something like this:

mpirun -hostfile $<(seq 0 $(flux getattr size)) -mca plm rsh -mca plm_rsh_agent "flux rsh" -N <something> <command>

Doing it with hydra would have a completely different but equally difficult to explain set of options. If we want to support running with mpirun and hydra, we should provide them a plugin or something to use, or provide a wrapper, but for the general case of "a user has a frustrating piece of software that wreckrun wont launch" I'd really like to have the hostfile.

grondo commented 6 years ago

If we had to generate a hostfile I would gravitate toward using an rc1 script since that is ostensibly the purpose of these scripts. One could be dropped into rc1.d/make-hostfile.sh or similar. The downside is that it would run on every startup, but the footprint should be smaller than, say, loading a module.

It would actually be nice if the rc1 scripts could export environment variables to the initial program. Then a FLUX_HOSTFILE variable could be set and your script above could become just one line (though if the hostfile is placed under broker.rundir or other local-only directory, this file would only be available on rank 0, though I'm guessing that is not a show-stopper. (otherwise the hostfile could be stored in the kvs)

example make-hostlist.sh:

FLUX_HOSTFILE=$(flux getattr broker.rundir)/hostlist
flux exec -r all hostname >$FLUX_HOSTFILE
# If only: flux rc-export FLUX_HOSTFILE

This is just the simplest thing I can think of right now.

garlick commented 6 years ago

I like the flux rc-export idea!

On the other hand, what about just providing a flux hostlist [jobid] command that the user's script could run?

dongahn commented 6 years ago

On the other hand, what about just providing a flux hostlist [jobid] command that the user's script could run?

The scheduler can also easily store the hostlist into the job schema if this makes it easy to provide this command.

grondo commented 6 years ago

On the other hand, what about just providing a flux hostlist [jobid] command that the user's script could run?

That is probably in essence what @trws is after, except not for jobids but for the instance as a whole. I was thinking we could perhaps avoid instantiating a new command if the only use case for the command is to redirect its output to a file.

Not opposed to a flux hostlist command, but for now would flux exec hostname >hostfile suffice as a stand-in for flux hostlist > hostfile?

If we had hostfile in a well-known kvs key as @dongahn proposed, then flux kvs get hostfile is also a pretty short command. (However, the key path for the hostfile could never change, so I guess that is a benefit of wrapping one liners in new commands)

grondo commented 6 years ago

but for now would flux exec hostname >hostfile suffice as a stand-in for flux hostlist > hostfile?

Sorry, meant to mention there are drawbacks to flux exec hostname -- the hostnames will come out unordered on stdout, and it may not scale all that well to very large number of ranks (not tested).

trws commented 6 years ago

That is actually what we did for a while. I switched over to using the kvs commands when we ran into nodes in a job that were not responding. It’s generally not too bad either way, but it ends up having to be something like “seq 0 $(flux getattr size) | xargs -n1 -I {} flux exec -r {} hostname > file”, part of this is generating it in a way where the command is independent of the size (preferably).


Sent from VMware Boxer

On April 26, 2018 at 2:47:45 PM PDT, Mark Grondona notifications@github.com wrote:

but for now would flux exec hostname >hostfile suffice as a stand-in for flux hostlist > hostfile?

Sorry, meant to mention there are drawbacks to flux exec hostname -- the hostnames will come out unordered on stdout, and it may not scale all that well to very large number of ranks (not tested).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1489#issuecomment-384799373, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStQrq3LgDTKtTvuyooMLhKvK9LMZWks5tskB1gaJpZM4Tm87e.

grondo commented 6 years ago

The scheduler can also easily store the hostlist into the job schema if this makes it easy to provide this command.

If the hostlist becomes part of the job kvs schema, then the broker or an rc1 script could copy that file into a similar (same?) location in the kvs of the new instance. (Actually there might be other kvs data that we would want to transfer from job kvs namespace to the root namespace of the sub-instance)

dongahn commented 6 years ago

That is probably in essence what @trws is after, except not for jobids but for the instance as a whole.

This is because he creates a new instance to work around a limitation within wreck? Essentially, though, we seem to need the hostlist of each jobid at the parent level using a known key and then this hostlist can be pushed onto the child instance's execution environment somehow. This way, the can be supported when a new instance is created or not?

grondo commented 6 years ago

It’s generally not too bad either way, but it ends up having to be something like “seq 0 $(flux getattr size) | xargs -n1 -I {} flux exec -r {} hostname > file”, part of this is generating it in a way where the command is independent of the size (preferably).

flux exec runs on all ranks by default, or you can supply the -r all option. (Unless I'm missing the point, which is possible)

dongahn commented 6 years ago

If the hostlist becomes part of the job kvs schema, then the broker or an rc1 script could copy that file into a similar (same?) location in the kvs of the new instance. (Actually there might be other kvs data that we would want to transfer from job kvs namespace to the root namespace of the sub-instance)

I think we are on the same line of thoughts.

grondo commented 6 years ago

This way, the can be supported when a new instance is created or not?

Yes, this works excepts for instances not started under flux.

dongahn commented 6 years ago

Yes, this works excepts for instances not started under flux.

True. When we don't have turtles all the way down: https://en.wikipedia.org/wiki/Turtles_all_the_way_down :-)

trws commented 6 years ago

Nope, I was in that I didn't realize that. We still would have had to do the xargs thing for allocations with one or more non-responsive nodes, but that's by far the uncommon case (except on sierra right now...)

On 26 Apr 2018, at 14:54, Mark Grondona wrote:

It’s generally not too bad either way, but it ends up having to be something like “seq 0 $(flux getattr size) | xargs -n1 -I {} flux exec -r {} hostname > file”, part of this is generating it in a way where the command is independent of the size (preferably).

flux exec runs on all ranks by default, or you can supply the -r all option. (Unless I'm missing the point, which is possible)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1489#issuecomment-384801002

garlick commented 6 years ago

I'm not so sure about incorporating the hostlist into the job schema.

In the unlikely event that a hostlist is needed (at least I sincerely hope this is not the common case!), one can easily get from the local rank to a hostname with flux exec, or if needed per job, then flux exec on the job's ranks mapped back to the enclosing instance's ranks.

That rank remapping for each job is the main thing that should be connecting each turtle IMHO.

grondo commented 6 years ago

In general, people do want to be able to go back and see on which nodes their jobs ran. That is the only reason I thought a list of hosts would be useful in the schema. But you are right that should be embedded in R anyway.

That rank remapping for each job is the main thing that should be connecting each turtle IMHO.

Sorry, not smart enough here. By rank remapping do you mean rank 0 in child is rank N in parent, is rank M in their parent, back to the system or other bootstrap instance? Seems you are onto something here, but I'm not exactly sure what (sorry if it was obvious)

dongahn commented 6 years ago

But you are right that should be embedded in R anyway.

Isn't R also a part of the job schema, though. We can also have flux hostlist be satisfied off of the hostname elements embedded R. R_lite currently doesn't have it though.

garlick commented 6 years ago

people do want to be able to go back and see on which nodes their jobs ran

Oh right I forgot about that use for this mapping!

By rank remapping do you mean rank 0 in child is rank N in parent, is rank M in their parent, back to the system or other bootstrap instance?

Yes that's all I meant - this mapping seems to me like the minimum amount of information needed to express how each job relates to its parent. I wasn't suggesting that hostlist generation would need to recurse all the way back to the system instance, just that expressing the hostlist mapping at each level in a hierarchy felt redundant to me.

Thinking of each hostname as a resource that is part of R, and including it in the "provenance" info for a job makes a lot of sense though, so maybe I was headed down the wrong way with that thought.

dongahn commented 6 years ago

What is a consensus?

Is a part of this to extend R_lite to include hostnames? In that case, how about we simply add node key:

lwj.0.0.1.R_lite = [ { "rank": 0, "node":"sierra1", "children": { "core": "0,1,2" }} ]

Then, a wrapper command can simply walk the R_lite array and fetch "node" keys from each element and concatenate them into a hostlist?

garlick commented 6 years ago

I'll go along with whatever you guys decide.

So...can the scheduler in a child instance initialize its resources from R_lite?

dongahn commented 6 years ago

No, we can't do this with R_lite. This should be addressed as part of our discussion to grow R_lite into R.

If we need this before that, we can have the scheduler optionally store rdl and use it to instantiate the child scheduler.

If affinity support is added, the child sheduler can also instantiate from the instance hwloc.

dongahn commented 6 years ago

Maybe one way to view R is: the data with two sections. One section has the data with a specific format that the execution service understands. The other section has the data that allows the child instance to build the resource represantation of its scheduler.

For the latter section, we discussed whether we want to fully spec out the format or go with an opaque approach.

The former approach will make it easy for different scheduler implementations to interoperate well. But this means the spec should be expressive enough to allow the most advanced scheduler to be able to build a complex resource represenation needed for advanced scheduling.

The latter won't require a rigor to design a powerful spec. But it will push the complexity to adapter development to allow one representation to be translated into another.

If we go with the former, the spec should be a full graph, and I still propose graphml with a custom schema.

grondo commented 6 years ago

Is a part of this to extend R_lite to include hostnames? In that case, how about we simply add node key lwj.0.0.1.R_lite = [ { "rank": 0, "node":"sierra1", "children": { "core": "0,1,2" }} ]

I'm not sure that solves the specific use-case here, which would require an extra step of propagating R_lite to the kvs of the child instance, at which point the "rank" keys may become confusing. (Though on second thought, the index into the R_lite array would serve as the mapping from local rank to parent rank (i.e. Rlite[local_rank].rank == parent_rank). So perhaps this is useful in the short term?)

If you want to add the node name I can't see that it would hurt anything, and then it will be there if we decide to use it. I would also allow a flux hostlist command to work with a jobid as @garlick had proposed.

trws commented 6 years ago

That would be useful in its own right for attaching external tooling and debugging submissions actually.

Sent from my iPad

On Apr 27, 2018, at 8:45 AM, Mark Grondona notifications@github.com<mailto:notifications@github.com> wrote:

Is a part of this to extend R_lite to include hostnames? In that case, how about we simply add node key lwj.0.0.1.R_lite = [ { "rank": 0, "node":"sierra1", "children": { "core": "0,1,2" }} ]

I'm not sure that solves the specific use-case here, which would require an extra step of propagating R_lite to the kvs of the child instance, at which point the "rank" keys may become confusing. (Though on second thought, the index into the R_lite array would serve as the mapping from local rank to parent rank (i.e. Rlite[local_rank].rank == parent_rank). So perhaps this is useful in the short term?)

If you want to add the node name I can't see that it would hurt anything, and then it will be there if we decide to use it. I would also allow a flux hostlist command to work with a jobid as @garlickhttps://github.com/garlick had proposed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1489#issuecomment-385010752, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStTtZYG62IfA_a3DfeuwVBvkJgSifks5tsz0IgaJpZM4Tm87e.

dongahn commented 6 years ago

@grondo and I discussed this offline.

Short of it is, I will add the node key to R_lite and @grondo will propose how rc script will allow the communication between the parent and child instance as well as how flux hostlist will be implemented.

grondo commented 6 years ago

My proposal is that flux hostlist [JOBID]... works as following:

Without any JOBID argument, flux hostlist will issue the hostlist for the current instance, one per rank. Internally flux hostlist will first check for a well-known kvs key:asterisk: hostlist(?) and dump that to stdout, and will fall back to flux exec hostname if the hostlist key is not found.

With a JOBID argument flux hostlist will issue the hostlist for that job by parsing R_lite. Multiple jobids could be supported in which case the union would be provided.

:asterisk: The well known key could be installed via rc script by using something like

flux kvs put --json hostlist="$(FLUX_URI=<parent_uri> flux hostlist ${FLUX_JOB_ID})"

Someone please suggest a well-known key. :-)

trws commented 6 years ago

resource.hosts?

Sent from my iPad

On Apr 27, 2018, at 9:53 AM, Mark Grondona notifications@github.com<mailto:notifications@github.com> wrote:

My proposal is that flux hostlist [JOBID]... works as following:

Without any JOBID argument, flux hostlist will issue the hostlist for the current instance, one per rank. Internally flux hostlist will first check for a well-known kvs key*️⃣ hostlist(?) and dump that to stdout, and will fall back to flux exec hostname if the hostlist key is not found.

With a JOBID argument flux hostlist will issue the hostlist for that job by parsing R_lite. Multiple jobids could be supported in which case the union would be provided.

*️⃣ The well known key could be installed via rc script by using something like

flux kvs put --json hostlist="$(FLUX_URI= flux hostlist ${FLUX_JOB_ID})"

Someone please suggest a well-known key. :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1489#issuecomment-385029554, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStYqHzSi9AHIF0k446Tyqb1fhFC01ks5ts0zAgaJpZM4Tm87e.

grondo commented 6 years ago

resource.hosts

Works for me! Thanks.

dongahn commented 6 years ago

Two liner posted https://github.com/flux-framework/flux-sched/pull/324.