Open dongahn opened 4 years ago
As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched. Since R would no longer be human readable, we'd need to develop some tools to display and operate on R as well...
My initial proposal would be just to use the format under scheduling
key of https://github.com/flux-framework/rfc/blob/master/spec_20.rst as our canonical format.
I will post a simple example that contains cluster, node and cores as currently used by flux-core. This limitation can serve as the V1 of this canonical format.
flux-sched
also has the code to parse it so we may borrow from that code for the straw man.
We discussed this a bit at today's meeting:
Is the canonical representation of R sufficiently good enough to be targeted by various writers/generators? A writer/generate will produce an R and expect that the rest of the system will more or less just work:
Then, other services will be serviced off of the R. Many will use manipulation libraries though.
A comment was made that JGF likely to work but a harder part will be how to map execution targets to R.
@dongahn, are there instructions for having flux-sched generate R for jobs which contain scheduler
key and JGF representation of resources? I'd like to generate samples of that format for study.
@grondo:
It would have been a bit nicer since it has some fixes, but please look at: https://github.com/flux-framework/flux-sched/blob/2c3b9ec75139f408f75ac3963b77c087598c27d6/t/t1006-recovery-full.t#L28
Load options (match-format=rv1
) should allow the fluxion-resource
to generate the full rv1 instead of rv1_nosched
which omit the JGF key.
I was planning to spend some time for this as well next week. So this is great timing.
You should also be able to change the match emit format though resource's rc1 script:
FLUXION_RESOURCE_OPTIONS="match-format=rv1 load-whitelist=node,core,gpu"
If you want to look at this for more advanced graph representations, please consider using resource-query
as well. It has the same emit options as an cli option.
-F, --match-format=<simple|pretty_simple|jgf|rlite|rv1|rv1_nosched>
Specify the emit format of the matched resource set.
(default=simple).
Example GRUG files including things like multi-tiered storage configurations:
https://github.com/flux-framework/flux-sched/blob/master/t/t3020-resource-mtl2.t#L9
Thanks! I was able to do:
$ flux module reload resource match-format=rv1
For my own benefit, here's an example rv1 for a 2-core allocation in a docker container
ƒ(s=1,d=0) fluxuser@428d6d454f60:~$ flux job info 646258360320 R | jq
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0",
"node": "428d6d454f60",
"children": {
"core": "2-3"
}
}
],
"starttime": 1589816042,
"expiration": 1590420842
},
"scheduling": {
"graph": {
"nodes": [
{
"id": "7",
"metadata": {
"type": "core",
"basename": "core",
"name": "core2",
"id": 2,
"uniq_id": 7,
"rank": 0,
"exclusive": true,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60/socket0/core2"
}
}
},
{
"id": "9",
"metadata": {
"type": "core",
"basename": "core",
"name": "core3",
"id": 3,
"uniq_id": 9,
"rank": 0,
"exclusive": true,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60/socket0/core3"
}
}
},
{
"id": "2",
"metadata": {
"type": "socket",
"basename": "socket",
"name": "socket0",
"id": 0,
"uniq_id": 2,
"rank": 0,
"exclusive": false,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60/socket0"
}
}
},
{
"id": "1",
"metadata": {
"type": "node",
"basename": "428d6d454f60",
"name": "428d6d454f60",
"id": -1,
"uniq_id": 1,
"rank": 0,
"exclusive": false,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60"
}
}
},
{
"id": "0",
"metadata": {
"type": "cluster",
"basename": "cluster",
"name": "cluster0",
"id": 0,
"uniq_id": 0,
"rank": -1,
"exclusive": false,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0"
}
}
}
],
"edges": [
{
"source": "2",
"target": "7",
"metadata": {
"name": {
"containment": "contains"
}
}
},
{
"source": "2",
"target": "9",
"metadata": {
"name": {
"containment": "contains"
}
}
},
{
"source": "1",
"target": "2",
"metadata": {
"name": {
"containment": "contains"
}
}
},
{
"source": "0",
"target": "1",
"metadata": {
"name": {
"containment": "contains"
}
}
}
]
}
}
}
Great!
Note that the scheduling
key has much more detailed information than R_lite
key even for two-core allocation. So things like high throughput case, I still want to specialize the scheduler behavior and omit JGF. In general, though, compared to how hwloc represents its resources (i.e., xml exportable), this would be lighter though.
As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched.
How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires?
How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires?
sched-simple does not use hwloc data directly, but instead reads the aggregated information from resource.hwloc.by_rank
, which is a flattened and very condensed list of resources (especially when all ranks are the same)
ƒ(s=64,d=0) fluxuser@16ea7ed726d5:~$ flux kvs get resource.hwloc.by_rank
{"[0-63]": {"Package": 1, "Core": 4, "PU": 4, "cpuset": "0-3"}}
Of course JGF has more than enough information in it to be used by the simple scheduler.
So things like high throughput case, I still want to specialize the scheduler behavior and omit JGF.
I thought we were proposing an Rv2 where the format was JGF?
FYI --
JGF reader code in flux-sched is https://github.com/flux-framework/flux-sched/blob/master/resource/readers/resource_reader_jgf.hpp, which reads this and updates the graph data store. It not only update the spatial schema of vertices and edges but also scheduler metadata, though.
@milroy has algorithms and code that can also grow the graph data store using a new JGF, which is the current topic for our cluster submission.
The emitted JGF can be fed into resource-query
and used for further scheduling as well.
Taking the JGF portion from your example and store that into ./resource.json
ahn1@49674596c035:/usr/src/resource/utilities$ flux mini run --dry-run -n 1 hostname > jobspec.json
ahn1@49674596c035:/usr/src/resource/utilities$ ./resource-query -L resource.json -f jgf -F pretty_simple
INFO: Loading a matcher: CA
resource-query> match allocate jobspec.json
---cluster0[1:shared]
------428d6d454f60[1:shared]
---------socket0[1:shared]
------------core3[1:exclusive]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================
ƒ(s=64,d=0) fluxuser@16ea7ed726d5:~$ flux kvs get resource.hwloc.by_rank {"[0-63]": {"Package": 1, "Core": 4, "PU": 4, "cpuset": "0-3"}}
I see.
Is Package
used by sched-simple
though?
Is cpuset
the inset of core IDs or PU IDs?
What does this look like when each rank's resource set is different? We may not have this case yet though.
I thought we were proposing an Rv2 where the format was JGF?
Where will Rv2 be used? Will the resource
module use the resource set of its instance from it to satisfy its query? If that's the case, does the R of each job include the full JGF or only those jobs that will spawn a new flux instance need it?
Is Package used by sched-simple though?
No, but by_rank
is the aggregate of hwloc data and other hwloc objects are summarized for informational purposes. I don't think this format was ever meant to be used long-term though.
What does this look like when each rank's resource set is different? We may not have this case yet though.
There is an idset entry for each set of rank or ranks that have different summary information.
Where will Rv2 be used? Will the resource module use the resource set of its instance from it to satisfy its query? If that's the case, does the R of each job include the full JGF or only those jobs that will spawn a new flux instance need it?
I thought Rv2 was going to be our next step towards a "canonical" resource set representation. I think you summarized it well in the comment above.
As canonical representation, R would be the common resource set serialization used by all Flux components that transmit and share resource sets
resource
module would use R during "discovery", e.g. fetch R from parent or use hwloc data to generate an R. Sorry if the above is obvious...
One simple idea would be to allow something between R_lite and full JGF by allowing JGF "nodes" to represent multiple identical resources. E.g. something along the lines of:
{
"nodes": [
{
"id": "0",
"metadata": {
"basename": "fluke",
"exclusive": false,
"ids": "60-63",
"ranks": "[60-63]",
"type": "node",
},
},
{
"id": "1",
"metadata": {
"basename": "core",
"exclusive": true,
"ids": "[0-3]",
"size": 4,
"type": "core",
}
}],
"edges": [
{
"metadata": {
"name": {
"containment": "contains"
}
},
"source": "0",
"target": "1"
}]
}
Would something like this be feasible I wonder?
Edit: I left out "cluster" and "socket" resources just to make the example readable. Also, removed the "paths" in the serialization because it seems like these can be computed when unpacking serialized graph, so is it really necessary to duplicate in the serialization format?
Yes, that's one example of compression schemes: https://github.com/flux-framework/flux-sched/issues/526
One thing that I don't know is whether applying an ad hoc compression to the representation itself would condense the RV2 better or applying a general compression on the 'canonical' representation would result better.
I would say both would probably be best. Even if you use general compression (by this I assume you mean something like gzip), there is some benefit to ad-hoc compression by decreasing the size of the JSON object ingested into the JSON parser...
This kind of "modified JGF" sounds pretty appealing to me.
We do already lz4 compress KVS data on the back end, so at least the KVS growth would be mitigated somewhat.
I like this direction. Just as a food for thoughts:
In terms of serialization and deserialization needs:
1) RV2 <--> JGF (un-condensed JGF) <--> Graph-like object to query and modify on
Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what resource
would do.
2) RV2 <--> Graph-like object to query and modify on
I think the trade-off space is
Probably serialization/deserialization costs would small so IMHO this would be the kvs storage and communication payloads vs. software complexity.
Maybe we should play with 1) and 2) a bit to make more progress. I have to think for simple cases this transformation would be straightforward (as I already did something similar for R_lite) but I don't know whether it will be straightforward for more complex case.
In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form.
In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form.
I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?
Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what resource would do.
I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R?
I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?
Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed.
Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed.
Oh yeah, and this would only work for resources at the highest level in the tree, for nodes [0-15]
sharing child sockets [0-1]
there are actually 32 unique socket resources, not just 2.
Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded.
One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like /cluster0/node8/socket0/core1
uniquely identifies resources, as long as interior resource nodes are never pruned when creating R for jobs.
Oh yeah, and this would only work for resources at the highest level in the tree, for nodes [0-15] sharing child sockets [0-1] there are actually 32 unique socket resources, not just 2.
I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained).
So we have to think about the loss of information and see if that's okay or not...
Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded.
Oh yeah, this should be possible!
One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like /cluster0/node8/socket0/core1 uniquely identifies resources, as long as interior resource nodes are never pruned when creating R for jobs.
Agreed.
(or make the condensed node more fine-grained)
One example where this makes sense is like Corona that will have two different types of nodes (one with 4 GPUs vs. 8 GPUs).
I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R?
I am talking about a phase where the proposed condensed JGF will be translated into the original JGF and vice versa.
For Fluxion, that may be the first step I may want to take.
Another example could be creating RV1 from an external source like Cray end points.
You may first want to collect the individual resource info from the external source and dump it into uncondensed JGF and then process it to become the proposed "condensed" RV2.
I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained).
Similarly,
{ "id": "0", "metadata": { "basename": "fluke", "exclusive": false, "ids": "60-63", "ranks": "[60-63]", "type": "node", }, },
I think ids and ranks in generally cannot be condensed cleanly this way?
I think ids and ranks in generally cannot be condensed cleanly this way?
If there are the same number of values for each key, then you can condense I would assume, though perhaps not cleanly. You would have to "condense" on a primary key, say "ids", then have some standard way of generating the other condensed keys based on either the index or the value of primary key.
For your example above, the idset for ids
and ranks
would be required to have the same size, and during expansion as you "pop" each id
you would pop its rank from the ranks
set.
That reminds me that idsets can't actually be used here since we'd need a list.
I would say both would probably be best. Even if you use general compression (by this I assume you mean something like gzip), there is some benefit to ad-hoc compression by decreasing the size of the JSON object ingested into the JSON parser...
Just for fun, I used xz
to compare the size of the individualized JGF vs. the proposed condensed JGF in a comparable form (remove some of the fields that were not used in the condensed JGF).
-rw-r--r-- 1 ahn1 ahn1 1587 May 18 20:34 jtest.json
-rw-r--r-- 1 ahn1 ahn1 692 May 18 20:15 prop.json
-rw-r--r-- 1 ahn1 ahn1 316 May 18 20:34 jtest.json.xz
-rw-r--r-- 1 ahn1 ahn1 284 May 18 20:15 prop.json.xz
While the condensed JGF reduces the data size by 2.29x, when compressed its impact isn't as dramatic: 1.11x. (wonders of compression tools...)
This impact would be much bigger at larger scale R though.
We may want to continue to test our proposed scheme to check the gains.
If there are the same number of values for each key, then you can condense I would assume, though perhaps not cleanly. You would have to "condense" on a primary key, say "ids", then have some standard way of generating the other condensed keys based on either the index or the value of primary key.
Exactly.
FWIW, when I gave some thoughts to it (https://github.com/flux-framework/flux-sched/issues/526#issuecomment-538664189), an insight I got was -- it would be best if other keys can be expressed as some regular function of the primary key...
In the mpir proctable encoding of the shell I had a similar ad-hoc scheme for doing a mixed range+delta JSON encoding of the proctable values.
A proctable entry is a JSON array with the form [hostname:s, app:s, rank:i, pid:i]
. For demonstration, the shell mpir implementation encodes the following set of arrays:
["foo0","myapp",0,1234]
["foo0","myapp",1,1235]
["foo0","myapp",2,1236]
["foo0","myapp",3,1237]
["foo1","myapp",4,4589]
["foo1","myapp",5,4590]
["foo1","myapp",6,4591]
["foo1","myapp",7,4592]
Into the final proctable object
{
"hosts":[["foo",[[0,-3],[1,-3]]]],
"executables":[["myapp",[[-1,-7]]]],
"ids":[[0,7]],
"pids":[[1234,3],[3352,3]]}
}
This works because each entry in the final object is required to be an encoded "array" of the same number of elements...
FWIW, when I gave some thoughts to it (flux-framework/flux-sched#526 (comment)), an insight I got was -- it would be best if other keys can be expressed as some regular function of the primary key...
Oh, that would be ideal!
One thing I'm clear on after this discussion today though,
Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...
Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:
$ lz4c jtest.json.txt
Compressed filename will be : jtest.json.txt.lz4
Compressed 1587 bytes into 427 bytes ==> 26.91%
$ lz4c prop.json.txt
Compressed filename will be : prop.json.txt.lz4
Compressed 692 bytes into 341 bytes ==> 49.28%
(3.71x and 2.02x respectively; with lz4c -9
, I get 4.42x and 2.26x)
Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...
@dongahn could you recompile this sentence with different optimization please? :-)
Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...
Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).
Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...
@dongahn could you recompile this sentence with different optimization please? :-)
Sorry. I guess the point I was trying to make: what should be our canonical representation -- the condensed form or un-condensed form. A compiler analogy: they have the canonical intermediate representation (IR), which then gets compiled down to machine code (actual storage format).
Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:
That's fine. The reason for the testing was just to see the relative advantages of two forms when "compressed".
Using your example:
The condensed form has x2.29 better in the raw sizes (1587/692). But when compressed with lz4c, the condensed form is only x1.25 better. Since we will likely keep the object compressed, I wasn't sure if this 25% was worthy extra complexity. But like I said the relative advantages would change at larger scale, so my comment:
We may want to continue to test our proposed scheme to check the gains.
Hope this makes better sense.
Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).
I don't have the right answer here either. BTW, don't get me wrong though. I'm asking all these questions to think this through. Hopefully we can settle on something really cool in the end :-).
Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).
@grondo: Just to confirm, I like the hybrid approach like I said in the last meeting. In a compiler world, there is a difference between canonical vs. non-canonical representations, but we don't have to be too pedantic here.
In particular at the system instance, it should be straightforward to emit the "condensed" form either from a resource configuration spec (or other external sources).
At this point, I am unclear how easy or difficult for Fluxion to emit the condensed form instead of the fully concretized JGF. But compression and such was the task we need to do anyway, having a target should be helpful. By making full specified JGF as the default representation, we will be able to take a phased approach to learn how to do this properly.
Two things:
The proposed form is very similar to GRUG (https://github.com/flux-framework/flux-sched/blob/master/resource/utilities/README.md#recipe-graph-definition). I used that format to specify a recipe to generate a fully concretized JGF. In fact, the first way to support RV2 from the system instance would be to use the new format as another generation recipe.
This issue isn't specific to RV2. But did you think about how to remap RV1 to the execution targets in a nested instance name space?
My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?)
My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?)
Great idea.
If we were to go to this route, I think we should consider explicitly formalizing the relationship between the task slot id space of the parent instance and the execution target ID space of a nested instance. (Augmenting some RFC).
The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance.
But if the relationship between the task slot id of the parent and the execution target ID space of a nested instance can become explicit and formalized, that would lead to a much efficient implementation, I think.
How easy or difficult to rewrite to do this annotation directly in the condensed format?
@grondo:
It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability.
My take away so far:
The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance.
This might be best, but I wasn't sure if it was a tractable problem! If we have a way to do it, then like you said the core resource module could annotate execution targets at instance startup in either the case of a system instance or child instance.
TBH, I'm not sure exactly the best way to add the execution target annotation to R yet. Would it be best to add a property to an existing resource (vertex), or would it make more sense to treat execution target as a "grouping" vertex (i.e. non-resource vertex).
It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability.
Great. Now that the most recent sched-simple PR is in I'm going to try to make some progress on Rv2.
In flux-core, we have a lot of users of the R_lite format that will need to transition to Rv2. My idea is to prototype a C API reader of some form of Rv2, and then add a function that can convert to R_lite as a transition tool.
Then we can begin to add functionality required by flux-core components (resource
, job-exec
, job-info
and sched-simple
modules, as well as the job shell), and transition these components to the new library, allowing underlying R format to change or be updated without breaking core.
Once this is working we can then update resource
module to use Rv2 in the acquire protocol, which would allow us to break our dependence on all ranks being "online" before the acquire first response.
@dongahn, do you have any suggestions on how to do a task slot/execution target annotation to a resource set? It seems like you have to have some way to group resources, so either a tag on every resource in the slot, or would it be better to allow some kind of virtual resource group vertex (similar to how slot is specified in jobspec)?
This might be best, but I wasn't sure if it was a tractable problem!
Functionality-wide this is tractable. Scalability-wide, we may need more cleverness, think.
We have a functionality proof of concept in our old scheduler (version 0.7). I called it link
since we link a rankless RDL-generated resource object to a rank by matching the resource signature between RDL and hwloc objects. (I used a simple match criteria but this can be improved.) But considering the nested system, this should be called map
or remap
operation.
This isn't that scalable because only one process does this operation.
This was discussed at https://github.com/flux-framework/flux-core/issues/2908#issuecomment-619153974 where a need for unifying static config, R and other sources (hwloc and vendor specific resource discovery services) into a canonical resource representation like JSON Graph Format (JGF) was expressed.
From @grondo: General worries about attempting to design "do everything" formats there.
But the goal is lofty and he is willing.
From @dongahn: