flux-framework / rfc

Flux RFC project
https://flux-framework.readthedocs.io/projects/flux-rfc/
7 stars 13 forks source link

Canonical representation of resources #237

Open dongahn opened 4 years ago

dongahn commented 4 years ago

This was discussed at https://github.com/flux-framework/flux-core/issues/2908#issuecomment-619153974 where a need for unifying static config, R and other sources (hwloc and vendor specific resource discovery services) into a canonical resource representation like JSON Graph Format (JGF) was expressed.

From @grondo: General worries about attempting to design "do everything" formats there.

It typically ends up making things more difficult for the simple cases, and inevitably impossible for some complex case you didn't at first consider. The idea of parsing global XML to translate it to JGF just to read "You have 4 cores" seems like a lot of churn for a high throughput case as an example.

But the goal is lofty and he is willing.

From @dongahn:

As I see where we are headed for high ends, more complex cases will come to our way much quicker than you would think. (e.g., multi-tiered storage support etc). Ways to statically configure a system will also have to change. (towards higher complexity) And there will likely be multiple ways. Also very likely, we will also have to deal with different ways to populate R (now hwloc; but later vendor-specific external services to discovery global storage resources...) Yet, we have to advance not only flux-core but also other components to keep bread of these changes. It seemed this was too high of complexity to deal with an ad-hoc fashion. Now, having the canonical jobspec was very helpful to make progress at different paces between flux-core and -sched and it feels like we can benefit from a similar arrangement. Having a full blown target representation first and slowly build up partial implementations. Also we have lots of experience with JGF with multiple efforts around it. It felt like it makes sense to leverage them as well.

grondo commented 4 years ago

As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched. Since R would no longer be human readable, we'd need to develop some tools to display and operate on R as well...

dongahn commented 4 years ago

My initial proposal would be just to use the format under scheduling key of https://github.com/flux-framework/rfc/blob/master/spec_20.rst as our canonical format.

I will post a simple example that contains cluster, node and cores as currently used by flux-core. This limitation can serve as the V1 of this canonical format.

flux-sched also has the code to parse it so we may borrow from that code for the straw man.

dongahn commented 4 years ago

We discussed this a bit at today's meeting:

Is the canonical representation of R sufficiently good enough to be targeted by various writers/generators? A writer/generate will produce an R and expect that the rest of the system will more or less just work:

Then, other services will be serviced off of the R. Many will use manipulation libraries though.

A comment was made that JGF likely to work but a harder part will be how to map execution targets to R.

grondo commented 4 years ago

@dongahn, are there instructions for having flux-sched generate R for jobs which contain scheduler key and JGF representation of resources? I'd like to generate samples of that format for study.

dongahn commented 4 years ago

@grondo:

It would have been a bit nicer since it has some fixes, but please look at: https://github.com/flux-framework/flux-sched/blob/2c3b9ec75139f408f75ac3963b77c087598c27d6/t/t1006-recovery-full.t#L28

Load options (match-format=rv1) should allow the fluxion-resource to generate the full rv1 instead of rv1_nosched which omit the JGF key.

I was planning to spend some time for this as well next week. So this is great timing.

dongahn commented 4 years ago

You should also be able to change the match emit format though resource's rc1 script:

FLUXION_RESOURCE_OPTIONS="match-format=rv1 load-whitelist=node,core,gpu"

dongahn commented 4 years ago

If you want to look at this for more advanced graph representations, please consider using resource-query as well. It has the same emit options as an cli option.

    -F, --match-format=<simple|pretty_simple|jgf|rlite|rv1|rv1_nosched>
            Specify the emit format of the matched resource set.
            (default=simple).

Example GRUG files including things like multi-tiered storage configurations:

https://github.com/flux-framework/flux-sched/blob/master/t/t3020-resource-mtl2.t#L9

grondo commented 4 years ago

Thanks! I was able to do:

$ flux module reload resource match-format=rv1

For my own benefit, here's an example rv1 for a 2-core allocation in a docker container

ƒ(s=1,d=0) fluxuser@428d6d454f60:~$ flux job info 646258360320 R | jq
{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0",
        "node": "428d6d454f60",
        "children": {
          "core": "2-3"
        }
      }
    ],
    "starttime": 1589816042,
    "expiration": 1590420842
  },
  "scheduling": {
    "graph": {
      "nodes": [
        {
          "id": "7",
          "metadata": {
            "type": "core",
            "basename": "core",
            "name": "core2",
            "id": 2,
            "uniq_id": 7,
            "rank": 0,
            "exclusive": true,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60/socket0/core2"
            }
          }
        },
        {
          "id": "9",
          "metadata": {
            "type": "core",
            "basename": "core",
            "name": "core3",
            "id": 3,
            "uniq_id": 9,
            "rank": 0,
            "exclusive": true,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60/socket0/core3"
            }
          }
        },
        {
          "id": "2",
          "metadata": {
            "type": "socket",
            "basename": "socket",
            "name": "socket0",
            "id": 0,
            "uniq_id": 2,
            "rank": 0,
            "exclusive": false,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60/socket0"
            }
          }
        },
        {
          "id": "1",
          "metadata": {
            "type": "node",
            "basename": "428d6d454f60",
            "name": "428d6d454f60",
            "id": -1,
            "uniq_id": 1,
            "rank": 0,
            "exclusive": false,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60"
            }
          }
        },
        {
          "id": "0",
          "metadata": {
            "type": "cluster",
            "basename": "cluster",
            "name": "cluster0",
            "id": 0,
            "uniq_id": 0,
            "rank": -1,
            "exclusive": false,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0"
            }
          }
        }
      ],
      "edges": [
        {
          "source": "2",
          "target": "7",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        },
        {
          "source": "2",
          "target": "9",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        },
        {
          "source": "1",
          "target": "2",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        },
        {
          "source": "0",
          "target": "1",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        }
      ]
    }
  }
}
dongahn commented 4 years ago

Great!

Note that the scheduling key has much more detailed information than R_lite key even for two-core allocation. So things like high throughput case, I still want to specialize the scheduler behavior and omit JGF. In general, though, compared to how hwloc represents its resources (i.e., xml exportable), this would be lighter though.

As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched.

How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires?

grondo commented 4 years ago

How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires?

sched-simple does not use hwloc data directly, but instead reads the aggregated information from resource.hwloc.by_rank, which is a flattened and very condensed list of resources (especially when all ranks are the same)

ƒ(s=64,d=0) fluxuser@16ea7ed726d5:~$ flux kvs get resource.hwloc.by_rank
{"[0-63]": {"Package": 1, "Core": 4, "PU": 4, "cpuset": "0-3"}}

Of course JGF has more than enough information in it to be used by the simple scheduler.

So things like high throughput case, I still want to specialize the scheduler behavior and omit JGF.

I thought we were proposing an Rv2 where the format was JGF?

dongahn commented 4 years ago

FYI --

JGF reader code in flux-sched is https://github.com/flux-framework/flux-sched/blob/master/resource/readers/resource_reader_jgf.hpp, which reads this and updates the graph data store. It not only update the spatial schema of vertices and edges but also scheduler metadata, though.

@milroy has algorithms and code that can also grow the graph data store using a new JGF, which is the current topic for our cluster submission.

The emitted JGF can be fed into resource-query and used for further scheduling as well.

Taking the JGF portion from your example and store that into ./resource.json

ahn1@49674596c035:/usr/src/resource/utilities$ flux mini run --dry-run -n 1 hostname > jobspec.json
ahn1@49674596c035:/usr/src/resource/utilities$ ./resource-query -L resource.json -f jgf -F pretty_simple
INFO: Loading a matcher: CA
resource-query> match allocate jobspec.json
      ---cluster0[1:shared]
      ------428d6d454f60[1:shared]
      ---------socket0[1:shared]
      ------------core3[1:exclusive]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================
dongahn commented 4 years ago
ƒ(s=64,d=0) fluxuser@16ea7ed726d5:~$ flux kvs get resource.hwloc.by_rank
{"[0-63]": {"Package": 1, "Core": 4, "PU": 4, "cpuset": "0-3"}}

I see.

Is Package used by sched-simple though? Is cpuset the inset of core IDs or PU IDs?

What does this look like when each rank's resource set is different? We may not have this case yet though.

dongahn commented 4 years ago

I thought we were proposing an Rv2 where the format was JGF?

Where will Rv2 be used? Will the resource module use the resource set of its instance from it to satisfy its query? If that's the case, does the R of each job include the full JGF or only those jobs that will spawn a new flux instance need it?

grondo commented 4 years ago

Is Package used by sched-simple though?

No, but by_rank is the aggregate of hwloc data and other hwloc objects are summarized for informational purposes. I don't think this format was ever meant to be used long-term though.

What does this look like when each rank's resource set is different? We may not have this case yet though.

There is an idset entry for each set of rank or ranks that have different summary information.

grondo commented 4 years ago

Where will Rv2 be used? Will the resource module use the resource set of its instance from it to satisfy its query? If that's the case, does the R of each job include the full JGF or only those jobs that will spawn a new flux instance need it?

I thought Rv2 was going to be our next step towards a "canonical" resource set representation. I think you summarized it well in the comment above.

As canonical representation, R would be the common resource set serialization used by all Flux components that transmit and share resource sets

Sorry if the above is obvious...

grondo commented 4 years ago

One simple idea would be to allow something between R_lite and full JGF by allowing JGF "nodes" to represent multiple identical resources. E.g. something along the lines of:

{
    "nodes": [
       {
          "id": "0",
          "metadata": {
            "basename": "fluke",
            "exclusive": false,
            "ids": "60-63",
            "ranks": "[60-63]",
            "type": "node",
          },
        },
        {
          "id": "1",
          "metadata": {
            "basename": "core",
            "exclusive": true,
            "ids": "[0-3]",
            "size": 4,
            "type": "core",
          }
        }],
      "edges": [
        {
          "metadata": {
            "name": {
              "containment": "contains"
            }
          },
          "source": "0",
          "target": "1"
        }]
}

Would something like this be feasible I wonder?

Edit: I left out "cluster" and "socket" resources just to make the example readable. Also, removed the "paths" in the serialization because it seems like these can be computed when unpacking serialized graph, so is it really necessary to duplicate in the serialization format?

dongahn commented 4 years ago

Yes, that's one example of compression schemes: https://github.com/flux-framework/flux-sched/issues/526

One thing that I don't know is whether applying an ad hoc compression to the representation itself would condense the RV2 better or applying a general compression on the 'canonical' representation would result better.

grondo commented 4 years ago

I would say both would probably be best. Even if you use general compression (by this I assume you mean something like gzip), there is some benefit to ad-hoc compression by decreasing the size of the JSON object ingested into the JSON parser...

garlick commented 4 years ago

This kind of "modified JGF" sounds pretty appealing to me.

We do already lz4 compress KVS data on the back end, so at least the KVS growth would be mitigated somewhat.

dongahn commented 4 years ago

I like this direction. Just as a food for thoughts:

In terms of serialization and deserialization needs:

1) RV2  <--> JGF (un-condensed JGF) <--> Graph-like object to query and modify on 

Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what resource would do.

2) RV2  <--> Graph-like object to query and modify on

I think the trade-off space is

Probably serialization/deserialization costs would small so IMHO this would be the kvs storage and communication payloads vs. software complexity.

Maybe we should play with 1) and 2) a bit to make more progress. I have to think for simple cases this transformation would be straightforward (as I already did something similar for R_lite) but I don't know whether it will be straightforward for more complex case.

In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form.

grondo commented 4 years ago

In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form.

I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?

Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what resource would do.

I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R?

dongahn commented 4 years ago

I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?

Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed.

grondo commented 4 years ago

Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed.

Oh yeah, and this would only work for resources at the highest level in the tree, for nodes [0-15] sharing child sockets [0-1] there are actually 32 unique socket resources, not just 2.

grondo commented 4 years ago

Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded.

One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like /cluster0/node8/socket0/core1 uniquely identifies resources, as long as interior resource nodes are never pruned when creating R for jobs.

dongahn commented 4 years ago

Oh yeah, and this would only work for resources at the highest level in the tree, for nodes [0-15] sharing child sockets [0-1] there are actually 32 unique socket resources, not just 2.

I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained).

So we have to think about the loss of information and see if that's okay or not...

dongahn commented 4 years ago

Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded.

Oh yeah, this should be possible!

One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like /cluster0/node8/socket0/core1 uniquely identifies resources, as long as interior resource nodes are never pruned when creating R for jobs.

Agreed.

dongahn commented 4 years ago

(or make the condensed node more fine-grained)

One example where this makes sense is like Corona that will have two different types of nodes (one with 4 GPUs vs. 8 GPUs).

dongahn commented 4 years ago

I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R?

I am talking about a phase where the proposed condensed JGF will be translated into the original JGF and vice versa.

For Fluxion, that may be the first step I may want to take.

Another example could be creating RV1 from an external source like Cray end points.

You may first want to collect the individual resource info from the external source and dump it into uncondensed JGF and then process it to become the proposed "condensed" RV2.

dongahn commented 4 years ago

I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained).

Similarly,

{
"id": "0",
"metadata": {
"basename": "fluke",
"exclusive": false,
"ids": "60-63",
"ranks": "[60-63]",
"type": "node",
},
},

I think ids and ranks in generally cannot be condensed cleanly this way?

grondo commented 4 years ago

I think ids and ranks in generally cannot be condensed cleanly this way?

If there are the same number of values for each key, then you can condense I would assume, though perhaps not cleanly. You would have to "condense" on a primary key, say "ids", then have some standard way of generating the other condensed keys based on either the index or the value of primary key.

For your example above, the idset for ids and ranks would be required to have the same size, and during expansion as you "pop" each id you would pop its rank from the ranks set.

That reminds me that idsets can't actually be used here since we'd need a list.

dongahn commented 4 years ago

I would say both would probably be best. Even if you use general compression (by this I assume you mean something like gzip), there is some benefit to ad-hoc compression by decreasing the size of the JSON object ingested into the JSON parser...

Just for fun, I used xz to compare the size of the individualized JGF vs. the proposed condensed JGF in a comparable form (remove some of the fields that were not used in the condensed JGF).

-rw-r--r--   1 ahn1 ahn1   1587 May 18 20:34 jtest.json
-rw-r--r--   1 ahn1 ahn1    692 May 18 20:15 prop.json

-rw-r--r--   1 ahn1 ahn1    316 May 18 20:34 jtest.json.xz
-rw-r--r--   1 ahn1 ahn1    284 May 18 20:15 prop.json.xz

While the condensed JGF reduces the data size by 2.29x, when compressed its impact isn't as dramatic: 1.11x. (wonders of compression tools...)

This impact would be much bigger at larger scale R though.

We may want to continue to test our proposed scheme to check the gains.

jtest.json.txt prop.json.txt

dongahn commented 4 years ago

If there are the same number of values for each key, then you can condense I would assume, though perhaps not cleanly. You would have to "condense" on a primary key, say "ids", then have some standard way of generating the other condensed keys based on either the index or the value of primary key.

Exactly.

FWIW, when I gave some thoughts to it (https://github.com/flux-framework/flux-sched/issues/526#issuecomment-538664189), an insight I got was -- it would be best if other keys can be expressed as some regular function of the primary key...

grondo commented 4 years ago

In the mpir proctable encoding of the shell I had a similar ad-hoc scheme for doing a mixed range+delta JSON encoding of the proctable values.

A proctable entry is a JSON array with the form [hostname:s, app:s, rank:i, pid:i]. For demonstration, the shell mpir implementation encodes the following set of arrays:

["foo0","myapp",0,1234]
["foo0","myapp",1,1235]
["foo0","myapp",2,1236]
["foo0","myapp",3,1237]
["foo1","myapp",4,4589]
["foo1","myapp",5,4590]
["foo1","myapp",6,4591]
["foo1","myapp",7,4592]

Into the final proctable object

{
  "hosts":[["foo",[[0,-3],[1,-3]]]],
  "executables":[["myapp",[[-1,-7]]]],
  "ids":[[0,7]],
  "pids":[[1234,3],[3352,3]]}
}

This works because each entry in the final object is required to be an encoded "array" of the same number of elements...

grondo commented 4 years ago

FWIW, when I gave some thoughts to it (flux-framework/flux-sched#526 (comment)), an insight I got was -- it would be best if other keys can be expressed as some regular function of the primary key...

Oh, that would be ideal!

dongahn commented 4 years ago

One thing I'm clear on after this discussion today though,

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

garlick commented 4 years ago

Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:

$ lz4c jtest.json.txt
Compressed filename will be : jtest.json.txt.lz4 
Compressed 1587 bytes into 427 bytes ==> 26.91%                                
$ lz4c prop.json.txt
Compressed filename will be : prop.json.txt.lz4 
Compressed 692 bytes into 341 bytes ==> 49.28%   

(3.71x and 2.02x respectively; with lz4c -9, I get 4.42x and 2.26x)

garlick commented 4 years ago

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

@dongahn could you recompile this sentence with different optimization please? :-)

grondo commented 4 years ago

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).

dongahn commented 4 years ago

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

@dongahn could you recompile this sentence with different optimization please? :-)

Sorry. I guess the point I was trying to make: what should be our canonical representation -- the condensed form or un-condensed form. A compiler analogy: they have the canonical intermediate representation (IR), which then gets compiled down to machine code (actual storage format).

dongahn commented 4 years ago

Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:

That's fine. The reason for the testing was just to see the relative advantages of two forms when "compressed".

Using your example:

The condensed form has x2.29 better in the raw sizes (1587/692). But when compressed with lz4c, the condensed form is only x1.25 better. Since we will likely keep the object compressed, I wasn't sure if this 25% was worthy extra complexity. But like I said the relative advantages would change at larger scale, so my comment:

We may want to continue to test our proposed scheme to check the gains.

Hope this makes better sense.

dongahn commented 4 years ago

Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).

I don't have the right answer here either. BTW, don't get me wrong though. I'm asking all these questions to think this through. Hopefully we can settle on something really cool in the end :-).

dongahn commented 4 years ago

Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).

@grondo: Just to confirm, I like the hybrid approach like I said in the last meeting. In a compiler world, there is a difference between canonical vs. non-canonical representations, but we don't have to be too pedantic here.

In particular at the system instance, it should be straightforward to emit the "condensed" form either from a resource configuration spec (or other external sources).

At this point, I am unclear how easy or difficult for Fluxion to emit the condensed form instead of the fully concretized JGF. But compression and such was the task we need to do anyway, having a target should be helpful. By making full specified JGF as the default representation, we will be able to take a phased approach to learn how to do this properly.

Two things:

  1. We may need to specify edge types. The containment edge in [here]9https://github.com/flux-framework/rfc/issues/237#issuecomment-630340046) is a multiplicative edge, which means for each specified child vertex has an edge to each specified child vertex. But there are cases where we need an associated edge: via an edge a specified vertex is associated with another specified vertex.

The proposed form is very similar to GRUG (https://github.com/flux-framework/flux-sched/blob/master/resource/utilities/README.md#recipe-graph-definition). I used that format to specify a recipe to generate a fully concretized JGF. In fact, the first way to support RV2 from the system instance would be to use the new format as another generation recipe.

  1. This issue isn't specific to RV2. But did you think about how to remap RV1 to the execution targets in a nested instance name space?
grondo commented 4 years ago

This issue isn't specific to RV2. But did you think about how to remap RV1 to the execution targets in a nested instance name space?

My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?)

dongahn commented 4 years ago

My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?)

Great idea.

If we were to go to this route, I think we should consider explicitly formalizing the relationship between the task slot id space of the parent instance and the execution target ID space of a nested instance. (Augmenting some RFC).

The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance.

But if the relationship between the task slot id of the parent and the execution target ID space of a nested instance can become explicit and formalized, that would lead to a much efficient implementation, I think.

dongahn commented 4 years ago

How easy or difficult to rewrite to do this annotation directly in the condensed format?

dongahn commented 4 years ago

@grondo:

It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability.

My take away so far:

  1. The fully concretized JGF is our default canonical resource set and we extend it to support a condensed form (like you proposed up there) as well.
  2. Investigate ways to emit hwloc info into RV2: our scheduler already knows how to do this in the fully concretized JGF so we only need a feasibility of this for the condensed form.
  3. Investigate ways for Fluxion to emit a condensed format (this would require multiple steps)
  4. Investigate ways to rewrite an RV2 object with slot ids for nested instance support
  5. @SteVwonder may want to test whether we can emit externally gathered multi-tiered storage enabled system configurations into the condensed RV2 (I have some ideas about how to formulate some good tests).
grondo commented 4 years ago

The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance.

This might be best, but I wasn't sure if it was a tractable problem! If we have a way to do it, then like you said the core resource module could annotate execution targets at instance startup in either the case of a system instance or child instance.

TBH, I'm not sure exactly the best way to add the execution target annotation to R yet. Would it be best to add a property to an existing resource (vertex), or would it make more sense to treat execution target as a "grouping" vertex (i.e. non-resource vertex).

grondo commented 4 years ago

It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability.

Great. Now that the most recent sched-simple PR is in I'm going to try to make some progress on Rv2.

In flux-core, we have a lot of users of the R_lite format that will need to transition to Rv2. My idea is to prototype a C API reader of some form of Rv2, and then add a function that can convert to R_lite as a transition tool.

Then we can begin to add functionality required by flux-core components (resource, job-exec, job-info and sched-simple modules, as well as the job shell), and transition these components to the new library, allowing underlying R format to change or be updated without breaking core.

Once this is working we can then update resource module to use Rv2 in the acquire protocol, which would allow us to break our dependence on all ranks being "online" before the acquire first response.

grondo commented 4 years ago

@dongahn, do you have any suggestions on how to do a task slot/execution target annotation to a resource set? It seems like you have to have some way to group resources, so either a tag on every resource in the slot, or would it be better to allow some kind of virtual resource group vertex (similar to how slot is specified in jobspec)?

dongahn commented 4 years ago

This might be best, but I wasn't sure if it was a tractable problem!

Functionality-wide this is tractable. Scalability-wide, we may need more cleverness, think.

We have a functionality proof of concept in our old scheduler (version 0.7). I called it link since we link a rankless RDL-generated resource object to a rank by matching the resource signature between RDL and hwloc objects. (I used a simple match criteria but this can be improved.) But considering the nested system, this should be called map or remap operation.

This isn't that scalable because only one process does this operation.