flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

negative ranks are reported in the pmix shell trace #98

Open garlick opened 6 months ago

garlick commented 6 months ago

Problem: pmix defines some special numerical ranks which look puzzling in the trace output

0.104s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall
    {"procs":[{"nspace":"ƒGnMtS4hm","rank":-2}],"info":

From the spec[1]:

The pmix_rank_t structure is a uint32_t type for rank values.

typedef uint32_t pmix_rank_t;

The following constants can be used to set a variable of the type pmix_rank_t. All definitions were introduced in version 1 of the standard unless otherwise marked. Valid rank values start at zero.

PMIX_RANK_UNDEF A value to request job-level data where the information itself is not associated with any specific rank, or when passing a pmix_proc_t identifier to an operation that only references the namespace field of that structure.

PMIX_RANK_WILDCARD A value to indicate that the user wants the data for the given key from every rank that posted that key.

PMIX_RANK_LOCAL_NODE Special rank value used to define groups of ranks. This constant defines the group of all ranks on a local node.

[1] section 3.2.3 of https://pmix.github.io/uploads/2021/10/pmix-standard-v4.1.pdf

and in pmix.h we have

#define PMIX_RANK_UNDEF     UINT32_MAX
#define PMIX_RANK_WILDCARD  UINT32_MAX-1
#define PMIX_RANK_LOCAL_NODE    UINT32_MAX-2        // all ranks on local node

Kind of meaningless to say that valid ranks start at zero when the type is an unsigned integer. But anyway.

The trace is a raw json dump of the interthread message used in the server upcall, so maybe one thing to do would be to encode ranks as I rather than i so the special ones look like a big number as opposed to a negative one.

We could also encode them as a string and use the above names as the encoding for the special values, but that might be going too far for a debug trace.