flux-framework / rfc

Flux RFC project
https://flux-framework.readthedocs.io/projects/flux-rfc/
7 stars 13 forks source link

Document MPIR (parallel debugger) interfaces #187

Open grondo opened 5 years ago

grondo commented 5 years ago

@dongahn and I were kicking around ideas for the implementation of the required MPIR1 interfaces in flux-framework/flux-core-v0.11#12. While some of the more advanced ideas will probably not make it into the v0.11 repo, we should capture the final design of the flux job debug utility along with the KVS schema used by a conforming job shell to communicate pids, rank, and executables to the job debug utility in an RFC.

We may also want to use this issue to continue discussion of a more scalable kvs representation of the PIDs in the kvs representation of MPIR_proctable.

1: https://www.mpi-forum.org/docs/mpir-specification-03-01-2018.pdf

grondo commented 5 years ago

Copying from flux-framework/flux-core-v0.11#12, @dongahn said:

For PIDs, I wonder if delta encoding or range encoding will also still be effective. If you can group these PIDs per flux rank, the PID would be likely to be close to one another (likley consecutive) and the list should be compressed pretty well with these encoding. Each PID in the group will be stored with flux rank as the key as a compressed form. Of cause we still have to map each MPI rank to the flux rank+pid pair. But I wonder that can be reconstructed with an implicit rule.

grondo commented 5 years ago

Yes, most of the time PIDs launched locally from a single job shell will be consecutive. If this were the case and each shell pushed a json object of the form:

{ "[0-7]": "[1234-11]" }

then best case there would be one entry in the final object per rank in the job. Is this what you are proposing above?

Each PID in the group will be stored with flux rank as the key as a compressed form.

Ah, were you proposing that we drop the MPI rank encoding as the key in the object above?

dongahn commented 5 years ago

Thanks @grondo. A couple more thoughts.

For PIDs, I wonder if delta encoding or range encoding will also still be effective.

One case where this won't be as effective will be when PIDs are wrapped. But this would be rare.

we still have to map each MPI rank to the flux rank+pid pair. But I wonder that can be reconstructed with an implicit rule

If PIDs are stored in the order they are created (keyed by each flux rank), will there be any case where ascending MPI ranks on each node will not map to these PIDs in a consecutive manner? I think this depends on how the execution subsystem will fork/exec the processes. As far as this subsystem forks/execs the processes in MPI rank order, this map can be estcabilished implicitly... Don't know at the point where these processes are created, the subsystem knows about MPI rank order though.

dongahn commented 5 years ago

Ah, were you proposing that we drop the MPI rank encoding as the key in the object above?

Yes.

grondo commented 5 years ago

If PIDs are stored in the order they are created (keyed by each flux rank), will there be any case where ascending MPI ranks on each node will not map to these PIDs in a consecutive manner?

Yes, for any distribution or mapping method that isn't "block" the MPI ranks will not be assigned consecutively by the job shell.

grondo commented 5 years ago

We might want a way to definitively map rank:localid or more generally hostname:localid for any job back to the actual MPI rank, regardless of whether the job is being debugged or not. Perhaps as part of this exercise, we could define a kvs schema for this mapping, and require the use of that to map an ordinal task id from any rank back to its MPI rank by the MPIR system?

dongahn commented 5 years ago

Yes, for any distribution or mapping method that isn't "block" the MPI ranks will not be assigned consecutively by the job shell.

Sorry I think my question was constructed poorly. The scheme would work if the MPI ranks are consecutive. But even if they are not, this can work as far as the ordering of MPI ranks on a given node is consistent with the ordering of PIDs.

Say the MPI ranks assigned to the node are: 0,7,15,23. Will the execution system fork/exec the process for 0 first then, 7, 15 and 23? If this consistency cannot be maintained, implicit mapping isn't possible.

dongahn commented 5 years ago

We might want a way to definitively map rank:localid or more generally hostname:localid for any job back to the actual MPI rank

Yes, that was sort of the idea. Your original proposal already had MPI rank to hostname (or Flux rank) mapping. So the remaining mapping needed appeared to be rank:localid. And I was hoping the implicating mapping would be possible if there would be consistency between PID creation order and MPI rank order per each flux rank.

, regardless of whether the job is being debugged or not.

Seems reasonable.

Perhaps as part of this exercise, we could define a kvs schema for this mapping, and require the use of that to map an ordinal task id from any rank back to its MPI rank by the MPIR system?

If such mapping would be used not only by the MPIR but also others, I'm all for it.

grondo commented 5 years ago

Will the execution system fork/exec the process for 0 first then, 7, 15 and 23? If this consistency cannot be maintained, implicit mapping isn't possible.

Whether the processes are forked in order or not, the PIDs would be reported in order. At least at first the processes forked in order, but I wouldn't say that will be the case for all future job shells. Perhaps for very fat nodes the job shell would want to use a crew of fork helpers or something to speed up the creation of children.

dongahn commented 5 years ago

Whether the processes are forked in order or not, the PIDs would be reported in order. At least at first the processes forked in order, but I wouldn't say that will be the case for all future job shells. Perhaps for very fat nodes the job shell would want to use a crew of fork helpers or something to speed up the creation of children.

This makes sense. I think either way, there is a good chance that PIDs per node would be highly compressible. But it seems there is no good scalable solution is MPI rank to local PID mapping.

dongahn commented 5 years ago

Maybe something like what you have https://github.com/flux-framework/rfc/issues/187#issuecomment-497014758 would be the best. Hopefully, on a given flux rank, both MPI ranks and PIDs can be well condensed. One can either use PID order as the reference order (probably the best) or MPI rank as the reference order to keep both orders consistent. I also wonder if for non-reference order, delta encoding would produce a better result than range encoding. Just a thought.