Closed grondo closed 5 years ago
Nice start, @grondo. Just wanted to add some distinctions... The R will contain node-resident resources that will become Rlocal
On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager).
Power and network bandwidth could fall into either of these two cases depending on whether a throttle was available to limit power or bandwidth serving the allocated resources.
Thanks, @lipari! You bring up some good points.
On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager).
These generic "global" job resources like licenses, burst buffer storage space, bandwidth, etc will have to be passed in to some sort of containment management, in case there is some action required to give access to the licenses, or reserve space, etc. My thought is that these would be included in Rlocal, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource. For example, a simple approach would be to have the plugins on the first node of a job operate on these resources.
What remains to be decided is how these resources do get included in Rn for each IMP n. We may have to put some tag on these kind of global resources so they are automatically included in any Rn
I would also argue that Rlocal for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree). This will give the IMP containment plugins a bit more information about where they are running in the global hierarchy, which could be useful, and also allows us to keep global resources discussed above in their proper place in any hierarchy.
E.g. instead of a simple Rlocal like socket[0]->core[0-3]
for an IMP managing a single socket, you might instead have llnl->cluster[5](name=hype)->node[113](name=hype113)->socket[0]->core[0-3]
My thought is that these would be included in Rlocal, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource.
I would also argue that Rlocal for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree).
I'm worried that these new parts of the plan are undermining the original need/goal for Rlocal . If the Rlocal contains resource types that the IMP doesn't directly control, then we are back to the original situation we had with a single large R. The IMP now needs to parse out and identify only its locally relevant resource types from a larger tree of information. Rather than going that route, I wonder if it wouldn't make more sense to just going back to having a single complete R document that is sent everywhere including the IMP (although this time around we are deciding that the complete R no longer needs to be signed).
In other words, if the IMP needs to parse out potentially extraneous information from Rlocal and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of Rlocal?
In other words, if the IMP needs to parse out potentially extraneous information from Rlocal and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of Rlocal?
Rlocal allows the instance that is starting the IMP to control the shape of the container under which the IMP will execute the job shell, instead of relying on the IMP to make that decision, when it doesn't have or need the necessary data to make the correct decision about what goes in the local "container".
I guess where you might differ in opinion is whether the parents of a resource are part of that conceptual container. I tend to think a container that is just "cpu0" doesn't make any sense, you need node->socket[0]->cpu[0]
at least to resolve the container. Taking that idea a bit further, in our resource model, node0
is not a valid container either, you need llnl->hype->node[0]
. The IMP won't have any containment plugins that try to operate on resource type "datacenter", "node", or "switch", etc., so the extra resources will be safely ignored. However, if a containment plugin happens to need this information, it may be at least able to get it. (location of off-node resources is the main use case I'm considering now)
Another benefit of Rlocal is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R (though even if you had this support for the IMP, I'm not convinced the IMP alone could make the right decision here). To realize this particular goal, Rlocal will need to be simple enough that the IMP or its plugins could parse it easily themselves.
A specific case where Rlocal might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker. To do this, the instance would break up local resources into multiple Rlocal and pass to each IMP. I don't see how it would be possible if the instance passed the global R to each IMP.
BTW, I was mainly taking a long-term view on inclusion of parent resources in Rlocal, and as long as it is possible to add that support in at a future date, I'm ok with leaving it out for now. I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in Rlocal such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc.
A specific case where Rlocal might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker.
Ah. I thought we agreed that we weren't doing that, that instead the IMP always controls all resources on the node and leaves further resource masking to the jobshell. But if we are reversing that I suppose that is fine.
I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in Rlocal such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc.
If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to Rlocal makes the name a bit of a misnomer. :)
Another benefit of Rlocal is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R
The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate.
If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to Rlocal makes the name a bit of a misnomer. :)
Yeah, I completely understand your sentiment. I'm fine with leaving Rlocal with only "local" resources (whatever "local" may mean), but then we have no proposed method to handle off-node resources (since IMP can only run at most within a node).
I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with.
The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate.
The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R.
If only Rlocal is sent to the IMP it doesn't need to read local HW configuration, it doesn't need to generate a second R' from from that information, and it doesn't need to do the work of the intersection. So that feels like quite a bit of code saved from a security significant piece of software.
I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with.
That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)?
The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R.
I think it is actually a lot simpler than that if there is just a single IMP per node. It just walks the tree of data in R, and looks up each resource it sees in its internal table "oh! that belongs to me, I'll note that", "nope that doesn't belong to me, skip it". There is no complicated intersection needed, really.
There is no complicated intersection needed, really.
Ok, I guess I couldn't visualize how to make it quite that simple.
Whereas with Rlocal The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in Rlocal to the plugin (or alternately each plugin could generate the list itself(. No comparisons needed at all. Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all.
That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)?
It would be preferable when possible. But it can be handled on a case-by-case basis.
Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container.
Whereas with Rlocal The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in Rlocal to the plugin
Yeah, actually it is slightly more complicated than I stated, but not much. Actually, in each case where it finds a resource it owns, it needs remember that AND all of the resources under it in the tree. But that is still pretty straight forward I think.
I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too. Flux allows two approaches to something like a "socket". We can either represent all of the sockets on a node as a single resource vertex and use the count within that vertex to represent all of the sockets, or we can have a resource vertex with a name/id/uuid/whatever foreach of the sockets.
In the former case, with individual resource vertices, the scheduler picks the exact resources, and the IMP just needs to carry out the instructions. In the latter, with numbered resources, the IMP needs to be more aware of what is happing with allocations on the node (for instance, if nodes are shared). But actually, I'm not sure that the IMP can read the scheduler's mind enough to always make the same selection pattern...and that could lead to resources being shared (sockets/cores) that the scheduler could have avoided. So actually, for things like sockets and cores I suspect that we will always use separate resource vertices.
Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all.
I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R.
Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container.
I'm not sure that is required, but either way you need to verify ownership of the resource whether it is through an IMP plugin or a plugin in the execution system.
One example I can think of is if you restricted access to a license server through some sort of iptables rules. For jobs granted access to the license server you would have to allow iptables rules to be modified in the network namespace of the job. I can envision how this could be done with an IMP plugin, but not at all what kind of system you'd need if you tried to do it through a plugin in the unprivileged execution modules. Besides, the containers can't exist until the IMP runs so there is an ordering problem there.
For this case would you consider licenses a "local" resource, or perhaps rename them network access tokens or something? (If that is the case then I could see that this scheme could work)
I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R.
Ok, but why? I guess it is immaterial if we pass R or Rlocal since the IMP always filters R to Rlocal anyway (it will just be a noop in the second case). The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?
I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too.
Yes, that is what I meant. Each containment plugin will need to know the list and count of types (especially for RAM) of each of the resource types it knows how to deal with.
Here's my proposal for a simplified Rlocal to satisfy near-term milestones, if it is acceptable that R and Rlocal have different specifications.
The simplified Rlocal as input to the IMP will be a JSON document with a list of resource types for which the IMP should create a container. The IMP will support a list of plugins which operate on one or more of each type, and access the Rlocal directly to determine the parameters of the containers they can create. e.g. a memory and cpu "cgroups" container would read the "sockets" "cpus" and "memory" fields of the Rlocal dictionary and add cpus and mems to a cpuset cgroup, and constrain memory with a memory cgroup.
The format of Rlocal might look (very roughly), something like:
{
"cpu": { "list": [ 0, 1, 2, 3] }, "count": 4 },
"socket": { "list": [ 0 ], "count": 1 },
"memory": { "count": 1024, "units": "MB" },
}
This is just off the cuff so there may be missing fields, but is meant to give a general idea.
If the IMP must take the full R as input, then I'd suggest a plugin to the IMP, provided by the instance, would generate this format as input to the IMP containment plugin infrastructure. That would further require that the IMP operate in privilege separation mode so that the plugin operating on R runs with permissions of the instance owner. This would avoid copy-and-paste parsing code between flux-framework projects, and allow a single, system-installed version of flux-security to work with multiple versions of other Flux projects which may generate R with different formats or capabilities.
The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?
Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.
One example I can think of is if you restricted access to a license server through some sort of iptables rules.
That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server.
That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server.
I actually don't know, but it was a real proposal from somewhere (but not sure if it was ever implemented).
Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.
You do make good points. I guess you are implementing Rlocal in either case, it is just a case of where it is generated. I say you have my consensus that non-local things should not go into Rlocal, and lets leave it at that for now. However, also take a look at my proposal above at possibly using instance provided code to parse R from within the IMP by running it in an unprivileged child.
Either way, I don't think we're closing the door on including either more or less resources in the R in future implementations, so I'm thinking we can move on for now...
However, also take a look at my proposal above at possibly using instance provided code to parse R from within the IMP by running it in an unprivileged child.
The devil is in the details there. At some point the IMP needs to ingest resource data in some form from an untrusted source. It could either be by parsing a resource document itself, or from parsing the output of something that parses the resource document. But at some point it always needs to be defensive and validate its input. I would need to know more to evaluate it, I think.
I'm not really clear why parsing R seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R could be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled.
I'm not really clear why parsing R is more seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R would be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled.
Ok, again I understand your point.
I guess I'm arguing that Rlocal as used internally by the IMP is a different, much simpler, format than R (sorry, maybe it shouldn't be called Rlocal anymore?). The amount of code being used would therefore be less, and therefore, provably less bugs.
The Rlocal format could evolve much more slowly than R, though I admit it hasn't been proven that R will change at any kind of pace that would require frequent updates to flux-security project, so perhaps a weaker argument here.
Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal....
Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal....
Totally agree.
To summarize, I think consensus here is that Rlocal should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP.
As far as the topic of this issue, which is the specification of R, I don't think that changes much. We still need to be able to generate some Rlocal from R, and therefore we'll need some kind of format of R that allows this within in an instance.
To summarize, I think consensus here is that Rlocal should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP.
IMHO this approach is sound.
As I already discussed this w/ @grondo, extracting Rlocal should be fully distributed so that a centralized component doesn't become a scalability bottleneck.
For needing to fetch R by many execution service modules to extract Rlocal , I believe this should be scalable as this would essentially have performance complexity of a broadcast... At some point, we may want to measure this though..
Sorry if this was too obvious.
I would like to have a bit more discussions on the main topic of this issue: R format. As @grondo nicely captured this in the beginning, R will serve as the input and output of a range of components. For example, it will be the input to the resource
service (resource selection part of scheduling) in a nested instance, as well as remote execution
service, and job shell. Similarly, it will be the output of the resource
service, and also of other related services, utilities, and even manual effort, e.g., :
resource
service;resrc
;It seems an important decision we can all benefit from at this point would be whether we want to spec out a common R format or going with an opaque approach with just common abstractions on it agreed upon.
W/o looking at this too closely, if we go with a graph format with an optional ability to annotate extra information on resource vertex and edge (i.e., the concepts described in RFC4), we should be able to describe the format captured in all of the above use cases, and I can contribute to that effort based on my currentresource-query
experience.
But a part of me is also saying whether this kind of rigor on the format is necessary at this point.
Another approach can be for each of format variants, we require a library to expose a set of common operations including "reader" and "writer."
The former would a bit more rigorous but at the same time it can be a bit more time consuming. But maybe something that we ought to do anyway.
There is also the third possibility which is to start to go with the opaque approach above but as we reach agreements on the common abstractions, we will know the requirement on the format better. And at that point, we can formalize the format. It seems we will have to write libraries that expose those abstractions around the R anyway, this also is not a bad idea IMHO either.
Thoughts?
Thanks @dongahn. Some very good thoughts above.
Another element we need to keep in mind is how various components of flux would manage dependencies of interpreting and managing R. Ideally perhaps, R format would be supported directly by flux-core, so that the execution system, which depends on it, can be tested stand-alone. However, this approach might lead to a lowest-common denominator format, which may not support the needs of advanced resource services and/or schedulers.
An argument might also be made the the R format is solely the domain of a resource service, and the R interpreter and generator should therefore by offered by that service, though that would leave the execution service dependent on resource services being installed, which might not be what we want.
Another approach would be to have the R spec opaque as you said, with each type supplying a corresponding API that satisfies the requirements of all use cases outside of resource service internals. Somehow the required implementation would be encoded in R itself and the correct implementation loaded at runtime.
One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live.
Of four possibilities, the last two seem most attractive from my perspective. A bit on the 4th option:
One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live.
This is a very interesting idea from my perspective @grondo. I consider the resource representation needed by flux-core
elements as a "proper" subset of the representation needed by resource
so this can work out nicely if we can reasonably separate out the baseline from the extension. I don't know if "section" is the right construct but I got the idea.
Why don't I put a few examples of the graph representations I plan to use by resource
and see what belongs to the baseline and what belongs to the core elements, and see if these are easily separable. I will use GraphML but other markup language capable of graph can do as well.
sounds good @dongahn! Thanks!
Here is a GraphML example that describes an R such that it has 1 cluster with 1 rack with 1 node with 2 sockets each with 2 cores/1 GPU/memory. This was actually emitted from my resource-query
utility. I will explain parts of this a bit in a separate posting as relevant to this ticket.
But the general description on GraphML itself can be found in something like this. The reason that I used GraphML is that, our resource model is essentially is a graph and as such I didn't think I needed to reinvent the wheel with other markup languages. Plus, there are already plenty of GraphML readers and writers out there including Boost Graph Library.
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key id="basename" for="node" attr.name="basename" attr.type="string" />
<key id="id" for="node" attr.name="id" attr.type="long" />
<key id="subsystems" for="node" attr.name="member_of" attr.type="string" />
<key id="esubsystems" for="edge" attr.name="member_of" attr.type="string" />
<key id="name" for="node" attr.name="name" attr.type="string" />
<key id="paths" for="node" attr.name="paths" attr.type="string" />
<key id="props" for="node" attr.name="props" attr.type="string" />
<key id="size" for="node" attr.name="size" attr.type="int" />
<key id="type" for="node" attr.name="type" attr.type="string" />
<key id="unit" for="node" attr.name="unit" attr.type="string" />
<graph id="G" edgedefault="directed" parse.nodeids="canonical" parse.edgeids="canonical" parse.order="nodesfirst">
<node id="n0">
<data key="basename">small</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">small0</data>
<data key="paths">{containment: "/small0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">cluster</data>
<data key="unit"></data>
</node>
<node id="n1">
<data key="basename">rack</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">rack0</data>
<data key="paths">{containment: "/small0/rack0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">rack</data>
<data key="unit"></data>
</node>
<node id="n2">
<data key="basename">node</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">node0</data>
<data key="paths">{containment: "/small0/rack0/node0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">node</data>
<data key="unit"></data>
</node>
<node id="n3">
<data key="basename">socket</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">socket0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">socket</data>
<data key="unit"></data>
</node>
<node id="n4">
<data key="basename">socket</data>
<data key="id">1</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">socket1</data>
<data key="paths">{containment: "/small0/rack0/node0/socket1"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">socket</data>
<data key="unit"></data>
</node>
<node id="n5">
<data key="basename">core</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">core0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0/core0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">core</data>
<data key="unit"></data>
</node>
<node id="n6">
<data key="basename">core</data>
<data key="id">1</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">core1</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0/core1"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">core</data>
<data key="unit"></data>
</node>
<node id="n7">
<data key="basename">core</data>
<data key="id">2</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">core2</data>
<data key="paths">{containment: "/small0/rack0/node0/socket1/core2"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">core</data>
<data key="unit"></data>
</node>
<node id="n8">
<data key="basename">core</data>
<data key="id">3</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">core3</data>
<data key="paths">{containment: "/small0/rack0/node0/socket1/core3"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">core</data>
<data key="unit"></data>
</node>
<node id="n9">
<data key="basename">gpu</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">gpu0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0/gpu0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">gpu</data>
<data key="unit"></data>
</node>
<node id="n10">
<data key="basename">gpu</data>
<data key="id">1</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">gpu1</data>
<data key="paths">{containment: "/small0/rack0/node0/socket1/gpu1"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">gpu</data>
<data key="unit"></data>
</node>
<node id="n11">
<data key="basename">memory</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">memory0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0/memory0"}</data>
<data key="props"></data>
<data key="size">4</data>
<data key="type">memory</data>
<data key="unit"></data>
</node>
<node id="n12">
<data key="basename">memory</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">memory0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket1/memory0"}</data>
<data key="props"></data>
<data key="size">4</data>
<data key="type">memory</data>
<data key="unit"></data>
</node>
<edge id="e0" source="n0" target="n1">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e1" source="n1" target="n0">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e2" source="n1" target="n2">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e3" source="n2" target="n1">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e4" source="n2" target="n3">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e5" source="n2" target="n4">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e6" source="n3" target="n2">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e7" source="n3" target="n5">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e8" source="n3" target="n6">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e9" source="n3" target="n9">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e10" source="n3" target="n11">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e11" source="n4" target="n2">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e12" source="n4" target="n7">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e13" source="n4" target="n8">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e14" source="n4" target="n10">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e15" source="n4" target="n12">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
<edge id="e16" source="n5" target="n3">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e17" source="n6" target="n3">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e18" source="n7" target="n4">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e19" source="n8" target="n4">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e20" source="n9" target="n3">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e21" source="n10" target="n4">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e22" source="n11" target="n3">
<data key="esubsystems">{containment: "in"}</data>
</edge>
<edge id="e23" source="n12" target="n4">
<data key="esubsystems">{containment: "in"}</data>
</edge>
</graph>
</graphml>
As you can see from:
<key id="basename" for="node" attr.name="basename" attr.type="string" />
<key id="id" for="node" attr.name="id" attr.type="long" />
<key id="subsystems" for="node" attr.name="member_of" attr.type="string" />
<key id="esubsystems" for="edge" attr.name="member_of" attr.type="string" />
<key id="name" for="node" attr.name="name" attr.type="string" />
<key id="paths" for="node" attr.name="paths" attr.type="string" />
<key id="props" for="node" attr.name="props" attr.type="string" />
<key id="size" for="node" attr.name="size" attr.type="int" />
<key id="type" for="node" attr.name="type" attr.type="string" />
<key id="unit" for="node" attr.name="unit" attr.type="string" />
a resource pool vertex contains 9 ~ 10 base fields (uuid
omitted for now):
basename
id
name
path
(JSON dictionary string)properties
(JSON dictionary string)size
type
unit
subsystems
(JSON dictionary whose keys are subsystems that this vertex belongs to)For example, the following resource vertex construct describes a socket resource (socket0
) contained within node0
. Note that "
is an xml encoding scheme for double quote ("
):
<node id="n3">
<data key="basename">socket</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">socket0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">socket</data>
<data key="unit"></data>
</node>
The following vertex construct describes a core resource (core0
) contained within socket0
:
<node id="n5">
<data key="basename">core</data>
<data key="id">0</data>
<data key="subsystems">{containment: "*"}</data>
<data key="name">core0</data>
<data key="paths">{containment: "/small0/rack0/node0/socket0/core0"}</data>
<data key="props"></data>
<data key="size">1</data>
<data key="type">core</data>
<data key="unit"></data>
</node>
Each edge construct describes a directional relationship between two resource vertices and has one data field. For example, the following describes a relational edge from socket0
vertex (whose vertex id is n3
) to core0
vertex (n5
).
<edge id="e7" source="n3" target="n5">
<data key="esubsystems">{containment: "contains"}</data>
</edge>
Now, my guess is that anything but subsystem
and esubsystem
fields would be the baseline R as needed by execution systems and other generators. But we should discuss.
In general, I believe R that is capable of describing a graph would be the most expressive way to support a wide range of producers and consumers. For example, a nestedresource
service can simply pass R from the guest KVS to read_graphml
to populate its graph data store before starting the nested scheduling service.
Other generators like a hwloc adapter can also easily emit its information to this format. hwloc won’t have the concept of subsystems so the subsystem
and esubsystem
fields would be empty, however.
One thing that bothers me a little is this encodes graph data in a bit redundant manner. But this is because the preexisting graphml writer emits the graph in this manner. Thus, this can be compressed to a good degree if we decide to enhance our own writers.
While GraphML makes sense as a resource representation to the scheduler, I think the resources passed to the execution engine on each node (still called Rlocal
Thanks @lipari. I believe that the stated plan is to do containment via a plugin to IMP and the binding at the jobshell level. Both will need adaptors to extract the info from the R to the format they require.
@grondo's proposal to R local is up there. How jobsell will access the R was discussed yesterday and there is a PR #114 as a result of that.
BTW, as I remember hwloc also supports reading from and writing to xml. My guess is translating the entire R to this xml format should be pretty straightforward.
Sorry, @dongahn. I read the PR 114 discussion of mods to RFC 16 as describing how the GraphML formatted R is represented in the KVS. It is unclear to me where we have specified the work of translating a GraphML R to the Rlocal
To clarify... I think it is asking a lot of the IMP to break a GraphML-formatted R down to Rlocal
We am open to alternatives, of course.
If the node-local resource discovery speed is an issue, i suppose we can add some index information such that node-local information can be found more expediantly...
What is your alternative, @lipari?
... translating a GraphML R to the Rlocal form and where the Rlocal's format is defined. I thought I read all the discussion, but I probably missed it.
https://github.com/flux-framework/rfc/issues/109#issuecomment-334306854 would be closest, I think.
Sorry, just catching up on this.
@lipari, some clarification: the execution subsystem creates Rlocal, but it is part of the enclosing instance and therefore can use instance services to create this representation. The exec service doesn't do binding, that is completed by the IMP, and therefore Rlocal will be in a format defined by the IMP, likely something very simple, not GraphML.
109 (comment) would be closest, I think.
Ah, thank you for the reminder.
What is your alternative, @lipari?
@dongahn, you are proposing GraphML to be the form of R. And I agree that is a sound proposal for the reasons you stated. But to be confident that GraphML is the best choice, it behooves us to consider how easily GraphML can be parsed to generate the node-focused Rlocal
This will facilitate the binding the execution engine needs to do, leveraging the power of the hwloc library: https://www.open-mpi.org/projects/hwloc/doc/v1.2/group__hwlocality__cpubinding.php.
The job shell might use hwloc for binding, but the IMP will need to create a container from Rlocal and I don't think the hwloc topology format nor the library itself will be helpful there.
And I was attempting to minimize the complexity of the IMP, if it needs to generate Rlocal.
I think we concluded that the most efficient approach is that Rlocal is input to the IMP, not generated by the IMP.
If it is later decided that the input to the IMP should be R, from which the IMP would generate its own Rlocal, then I would expect the IMP would utilize a library from the enclosing instance to operate on R.
In either case the IMP will always treat R as opaque and should not need to know its specific format.
And I was attempting to minimize the complexity of the IMP, if it needs to generate Rlocal. If this is turns out to be no problem for the IMP, then I'm fine with using GraphML to represent R.
I thought you had a good point in that node-local resource discovery from R should be efficient. I am sure we will revisit this as "optimization" topic as we get down to the implementation...
To summarize the discussion points so far:
From my perspective, gathering a bit more feedback on the GraphML proposal will help.
Thank you for the summary, @dongahn! It seems to me this discussion has bearing on whether to make GraphML be the format of R.
Thanks @dongahn.
It seems to me this discussion has bearing on whether to make GraphML be the format of R.
Yeah, the question is whether GraphML can be the baseline format of R
- Jobshell will get to R to do binding
The job shell, as well as any sub-instance started by the shell may require R in such a format that it can be parsed and used for binding or as configuration input without a special library from the parent, which may be running as a different user. This makes me think that a simpler "baseline format" may be required. Instance implementations may still use GraphML internally, but they would need a converter to and from the baseline format, with extra information stored in the extensions. At this point I'm not sure otherwise how the job shell and other instances running potentially different versions of Flux could reliably and safely make use of R directly.
Just for an argument sake,
The job shell, as well as any sub-instance started by the shell may require R in such a format that it can be parsed
GraphML is an XML and this can be parsed easily.
and used for binding
The parsed objects can be easily traversed and used for binding.
as configuration input without a special library from the parent, which may be running as a different user.
Again, this is just an XML and its dependency is xml not a special library from the parent instance.
This makes me think that a simpler "baseline format" may be required.
What do you have in mind?
This makes me think that a simpler "baseline format" may be required. Instance implementations may still use GraphML internally, but they would need a converter to and from the baseline format, with extra information stored in the extensions.
Is the worry having a baseline section and exertion section such a way that one can only parse the baseline section for baseline operation?
At this point I'm not sure otherwise how the job shell and other instances running potentially different versions of Flux could reliably and safely make use of R directly.
Again, GraphML is an XML; so you don't have to require a special library from the parent as far as the format is formally specified.
GraphML is an XML and this can be parsed easily.
Ok, I was worried that boost::graph library or special API from parent instance would be required.
What do you have in mind?
I don't have a good idea. But simple baseline format could be a yaml representation similar to that used in jobspec. This would cut the library requirements down from xml+yaml to just yaml. This is just one idea.
Is the worry having a baseline section and exertion section such a way that one can only parse the baseline section for baseline operation?
Not so much a worry as a way to have a baseline format that can be documented in an RFC and which all flux instances and utilities can understand. Then, for advanced schedulers, etc, the extensions section could be filled with extra information needed for the specific implementation.
Again, GraphML is an XML; so you don't have to require a special library from the parent as far as the format is formally specified.
Ok, understood. I was possibly confused about how parsing of GraphML works. It would still be nice to keep the number of different encodings down (we'll now have json, yaml, and xml), but for now I'm fine with whatever is most efficient for you.
Ok, I was worried that boost::graph library or special API from parent instance would be required.
There should be a number of libraries beyond boost that can do this. Probably good idea to document those. I will do that.
I don't have a good idea. But simple baseline format could be a yaml representation similar to that used in jobspec. This would cut the library requirements down from xml+yaml to just yaml. This is just one idea.
YAML probably works as well. But we will need to extend constructs to support edges and etc. I'm not saying this cannot be done, but like I said, if the graph constructs are already there in GraphML -- tested and hardened with existing libraries, my rational was why reinventing those... For jobspec, YAML makes sense because it needs to be more human readable. But for R, this will be likely meant for "machines" to process more than for human users...
Ok, understood. I was possibly confused about how parsing of GraphML works. It would still be nice to keep the number of different encodings down (we'll now have json, yaml, and xml), but for now I'm fine with whatever is most efficient for you.
I thought about introducing why yet another encoding, xml. But in reality I realized we have already introduced it, as hwloc uses that encoding.
I thought about introducing why yet another encoding, xml. But in reality I realized we have already introduced it, as hwloc uses that encoding.
Yes, but there are no real users of that in flux as yet. (the xml topology is just passed to other hwloc commands)
YAML probably works as well. But we will need to extend constructs to support edges and etc. I'm not saying this cannot be done, but like I said, if the graph constructs are already there in GraphML -- tested and hardened with existing libraries, my rational was why reinventing those...
Thanks, good points. I don't want to stall your progress with my minor questions. However it has been useful for my understanding, so thanks!
There is also the benefit that there seem to many graph viewers that take in GraphML (e.g. Gephi).
It might help me if we could work through how some simple use cases might work using GraphML from above... e.g. find the correct layout of tasks across R for simple task slot shapes (e.g. 1 core, 1 socket, 1 socket, 1 core, etc).
This issue is being opened to start a discussion on the use cases, API, and/or specification for the R as in RFC 15. R is the serialized version of any resource set, and is presumably produced by the serializer described in RFC4, consumed by the resource service in an instance as configuration, and used by the IMP and job shell to determine shape of containment and local resource slots.
In essence, the R format will be the way composite resource and resource configuration information will be transmitted to and from instances of Flux.
Ideally, the purpose of this issue is to determine the format of R such that a new RFC could be drafted.
To get the discussion started, here are some high level requirements and use cases for R:
R should act as resource configuration input to an instance, therefore it may be that configuration of even the system instance is written in R spec, or the configuration language (RDL?) generates R. (in fact, one use case might be to directly generate R from hwloc data)
Execution service in an instance needs to be able to generate Rlocal from R for each rank. So given a rank or even generic "resource vertex", there should be a function to generate an Rn from R, where Rn is a hierarchical subset of R.
The containment plugins in the IMP will need to query Rlocal for the list of local resources of given type or types on which the containment plugins operate. For instance, a memory plugin will need to determine the amount and location of RAM contained in Rlocal in order to set up memcg limits. Similarly a Socket/CPU plugin would need to iterate over or query the list of local sockets/cores in Rlocal to add these to the cgroup.
The job shell will use jobspec+R to determine the local 'task slots' that map to commands in the 'tasks' section.
Dependency management here might get challenging. The IMP is a user of Rn, but we want to ideally eliminate dependencies in the flux-security project on other flux-framework projects. Possible approaches here might include: