Need specification for "resource set", R

grondo commented 6 years ago

This issue is being opened to start a discussion on the use cases, API, and/or specification for the R as in RFC 15. R is the serialized version of any resource set, and is presumably produced by the serializer described in RFC4, consumed by the resource service in an instance as configuration, and used by the IMP and job shell to determine shape of containment and local resource slots.

In essence, the R format will be the way composite resource and resource configuration information will be transmitted to and from instances of Flux.

Ideally, the purpose of this issue is to determine the format of R such that a new RFC could be drafted.

To get the discussion started, here are some high level requirements and use cases for R:

R should act as resource configuration input to an instance, therefore it may be that configuration of even the system instance is written in R spec, or the configuration language (RDL?) generates R. (in fact, one use case might be to directly generate R from hwloc data)
Execution service in an instance needs to be able to generate R_local from R for each rank. So given a rank or even generic "resource vertex", there should be a function to generate an R_n from R, where R_n is a hierarchical subset of R.
The containment plugins in the IMP will need to query R_local for the list of local resources of given type or types on which the containment plugins operate. For instance, a memory plugin will need to determine the amount and location of RAM contained in R_local in order to set up memcg limits. Similarly a Socket/CPU plugin would need to iterate over or query the list of local sockets/cores in R_local to add these to the cgroup.
The job shell will use jobspec+R to determine the local 'task slots' that map to commands in the 'tasks' section.

Dependency management here might get challenging. The IMP is a user of R_n, but we want to ideally eliminate dependencies in the flux-security project on other flux-framework projects. Possible approaches here might include:

The IMP could take a subset of the R specification, simple enough to parse with its own parser and offer some simplified interface to IMP plugins that do containment. Containment plugins probably only really need to get a list of local resources from R, as long as the logical IDs match the actual system logical IDs. The IMP's interface to R could later be expanded to offer higher level functionality for advanced containment (though I don't have any use cases that I can think of here)
Alternately, the IMP itself could treat R as opaque data, passed to plugins. The plugins would then have a dependency on some library from a system-installed flux core or sched project.

lipari commented 6 years ago

Nice start, @grondo. Just wanted to add some distinctions... The R will contain node-resident resources that will become R_local as well as ancillary resources like switches, racks, licenses, bandwidth and power. For resources like a rack or a switch, the scheduler may schedule jobs that are distributed to specific racks or switches. However, one could imagine that the allocated rack or switch in this case remains in the scheduler's domain and does not get included in the R that is passed to the system instance (as embodied by a collection of brokers).

On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager).

Power and network bandwidth could fall into either of these two cases depending on whether a throttle was available to limit power or bandwidth serving the allocated resources.

grondo commented 6 years ago

Thanks, @lipari! You bring up some good points.

On the other hand, for resources for which we will have controls (e.g., license managers), these resources would become part of R but would only be relevant to specific agents like license managers, and not the brokers (unless we have a broker devoted to controlling a license manager).

These generic "global" job resources like licenses, burst buffer storage space, bandwidth, etc will have to be passed in to some sort of containment management, in case there is some action required to give access to the licenses, or reserve space, etc. My thought is that these would be included in R_local, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource. For example, a simple approach would be to have the plugins on the first node of a job operate on these resources.

What remains to be decided is how these resources do get included in R_n for each IMP n. We may have to put some tag on these kind of global resources so they are automatically included in any R_n

I would also argue that R_local for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree). This will give the IMP containment plugins a bit more information about where they are running in the global hierarchy, which could be useful, and also allows us to keep global resources discussed above in their proper place in any hierarchy.

E.g. instead of a simple R_local like socket[0]->core[0-3] for an IMP managing a single socket, you might instead have llnl->cluster[5](name=hype)->node[113](name=hype113)->socket[0]->core[0-3]

morrone commented 6 years ago

My thought is that these would be included in Rlocal, and then the container plugin specific to that resource type would be able to decide how to contain or make available that specific resource.

I would also argue that Rlocal for any IMP should include not only the resource vertex(es) on which the IMP will be run, but also all the parents of the vertex up to the root (in the hierarchical resource tree).

I'm worried that these new parts of the plan are undermining the original need/goal for R_local . If the R_local contains resource types that the IMP doesn't directly control, then we are back to the original situation we had with a single large R. The IMP now needs to parse out and identify only its locally relevant resource types from a larger tree of information. Rather than going that route, I wonder if it wouldn't make more sense to just going back to having a single complete R document that is sent everywhere including the IMP (although this time around we are deciding that the complete R no longer needs to be signed).

In other words, if the IMP needs to parse out potentially extraneous information from R_local and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of R_local?

grondo commented 6 years ago

In other words, if the IMP needs to parse out potentially extraneous information from Rlocal and find the point where the information begins to align with its local resources, then it could do that just as easily from a large R. What is the value then of Rlocal?

R_local allows the instance that is starting the IMP to control the shape of the container under which the IMP will execute the job shell, instead of relying on the IMP to make that decision, when it doesn't have or need the necessary data to make the correct decision about what goes in the local "container".

I guess where you might differ in opinion is whether the parents of a resource are part of that conceptual container. I tend to think a container that is just "cpu0" doesn't make any sense, you need node->socket[0]->cpu[0] at least to resolve the container. Taking that idea a bit further, in our resource model, node0 is not a valid container either, you need llnl->hype->node[0]. The IMP won't have any containment plugins that try to operate on resource type "datacenter", "node", or "switch", etc., so the extra resources will be safely ignored. However, if a containment plugin happens to need this information, it may be at least able to get it. (location of off-node resources is the main use case I'm considering now)

Another benefit of R_local is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R (though even if you had this support for the IMP, I'm not convinced the IMP alone could make the right decision here). To realize this particular goal, R_local will need to be simple enough that the IMP or its plugins could parse it easily themselves.

grondo commented 6 years ago

A specific case where R_local might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker. To do this, the instance would break up local resources into multiple R_local and pass to each IMP. I don't see how it would be possible if the instance passed the global R to each IMP.

BTW, I was mainly taking a long-term view on inclusion of parent resources in R_local, and as long as it is possible to add that support in at a future date, I'm ok with leaving it out for now. I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in R_local such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc.

morrone commented 6 years ago

A specific case where Rlocal might be required is if an instance, for testing or other good reason, would like to start more than one IMP per broker.

Ah. I thought we agreed that we weren't doing that, that instead the IMP always controls all resources on the node and leaves further resource masking to the jobshell. But if we are reversing that I suppose that is fine.

I think off-node resources like burst-buffer space and licenses could be handled by including enough metadata in those resources included in Rlocal such that a plugin could know exactly which licenses it was operating on, or which burst buffers it was reserving space in... etc.

If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to R_local makes the name a bit of a misnomer. :)

morrone commented 6 years ago

Another benefit of Rlocal is potentially eliminating dependence on flux-sched or flux-core provided resource query language that might be required to perform the intersection between local resources and global R

The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate.

grondo commented 6 years ago

If we think that the IMP is going to need to know about global resources, then again I think I'm back to thinking we should just send the full R. We can always add annotations to that if we ever want to run multiple IMPs per node. Adding global resources to Rlocal makes the name a bit of a misnomer. :)

Yeah, I completely understand your sentiment. I'm fine with leaving R_local with only "local" resources (whatever "local" may mean), but then we have no proposed method to handle off-node resources (since IMP can only run at most within a node).

morrone commented 6 years ago

I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with.

grondo commented 6 years ago

The IMP reading a global R does not necessarily imply that the IMP must use a "resource query language". The IMP just needs a parser. The parser for R and Rlocal, if Rlocal is allowed to contain global resources, will be very nearly identical I think. We can choose to implement the parser twice or cut-and-paste it into the IMP if we want to keep it separate.

The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R.

If only R_local is sent to the IMP it doesn't need to read local HW configuration, it doesn't need to generate a second R' from from that information, and it doesn't need to do the work of the intersection. So that feels like quite a bit of code saved from a security significant piece of software.

grondo commented 6 years ago

I thought we had talked about that. I think that the "execution management" module, or whatever we are calling it now, would handle off-compute-node resource setup before launching remote execution and the IMPs. There would be plugins into that module that can instantiate the various resources that people come up with.

That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)?

morrone commented 6 years ago

The IMP will need to parse R, but then how does it complete the intersection between local available resources and R. It would need to generate an R' from hwloc or some other local HW query code, then take the intersection of R' and R.

I think it is actually a lot simpler than that if there is just a single IMP per node. It just walks the tree of data in R, and looks up each resource it sees in its internal table "oh! that belongs to me, I'll note that", "nope that doesn't belong to me, skip it". There is no complicated intersection needed, really.

grondo commented 6 years ago

There is no complicated intersection needed, really.

Ok, I guess I couldn't visualize how to make it quite that simple.

Whereas with R_local The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in R_local to the plugin (or alternately each plugin could generate the list itself(. No comparisons needed at all. Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all.

morrone commented 6 years ago

That could work but the instance doesn't have any privilege except through the IMP. Is it a requirement that all off node resources don't require privilege to access (this is possible, I just didn't think of that way before)?

It would be preferable when possible. But it can be handled on a case-by-case basis.

Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container.

morrone commented 6 years ago

Whereas with Rlocal The IMP would walk each type of resource for which it has a containment plugin and hand the list of those resources in Rlocal to the plugin

Yeah, actually it is slightly more complicated than I stated, but not much. Actually, in each case where it finds a resource it owns, it needs remember that AND all of the resources under it in the tree. But that is still pretty straight forward I think.

I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too. Flux allows two approaches to something like a "socket". We can either represent all of the sockets on a node as a single resource vertex and use the count within that vertex to represent all of the sockets, or we can have a resource vertex with a name/id/uuid/whatever foreach of the sockets.

In the former case, with individual resource vertices, the scheduler picks the exact resources, and the IMP just needs to carry out the instructions. In the latter, with numbered resources, the IMP needs to be more aware of what is happing with allocations on the node (for instance, if nodes are shared). But actually, I'm not sure that the IMP can read the scheduler's mind enough to always make the same selection pattern...and that could lead to resources being shared (sockets/cores) that the scheduler could have avoided. So actually, for things like sockets and cores I suspect that we will always use separate resource vertices.

Since there won't be containment plugins for "node" "switch" "datacenter", and other resources, those would be safely ignored if they were there at all.

I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R.

grondo commented 6 years ago

Doing this through the IMP could potentially introduce a fair bit of complexity. We might be back to needing a away to track authority back through multiple levels of flux instances. We were able to avoid that when the IMP was constrained to dealing with resources inside of its local node's container.

I'm not sure that is required, but either way you need to verify ownership of the resource whether it is through an IMP plugin or a plugin in the execution system.

One example I can think of is if you restricted access to a license server through some sort of iptables rules. For jobs granted access to the license server you would have to allow iptables rules to be modified in the network namespace of the job. I can envision how this could be done with an IMP plugin, but not at all what kind of system you'd need if you tried to do it through a plugin in the unprivileged execution modules. Besides, the containers can't exist until the IMP runs so there is an ordering problem there.

For this case would you consider licenses a "local" resource, or perhaps rename them network access tokens or something? (If that is the case then I could see that this scheme could work)

grondo commented 6 years ago

I think is reasonable for the IMP to know what node it is on. And once it knows that, it will be fairly easy to pick out its own resources from the global R.

Ok, but why? I guess it is immaterial if we pass R or R_local since the IMP always filters R to R_local anyway (it will just be a noop in the second case). The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?

I don't think the IMP can only look at types even for an Rlocal. It needs to look at either names or counts too.

Yes, that is what I meant. Each containment plugin will need to know the list and count of types (especially for RAM) of each of the resource types it knows how to deal with.

grondo commented 6 years ago

Here's my proposal for a simplified R_local to satisfy near-term milestones, if it is acceptable that R and R_local have different specifications.

The simplified R_local as input to the IMP will be a JSON document with a list of resource types for which the IMP should create a container. The IMP will support a list of plugins which operate on one or more of each type, and access the R_local directly to determine the parameters of the containers they can create. e.g. a memory and cpu "cgroups" container would read the "sockets" "cpus" and "memory" fields of the R_local dictionary and add cpus and mems to a cpuset cgroup, and constrain memory with a memory cgroup.

The format of R_local might look (very roughly), something like:

{
 "cpu": { "list": [ 0, 1, 2, 3] }, "count": 4 },
 "socket": { "list": [ 0 ],  "count": 1 },
 "memory": { "count": 1024, "units": "MB" },
}

This is just off the cuff so there may be missing fields, but is meant to give a general idea.

If the IMP must take the full R as input, then I'd suggest a plugin to the IMP, provided by the instance, would generate this format as input to the IMP containment plugin infrastructure. That would further require that the IMP operate in privilege separation mode so that the plugin operating on R runs with permissions of the instance owner. This would avoid copy-and-paste parsing code between flux-framework projects, and allow a single, system-installed version of flux-security to work with multiple versions of other Flux projects which may generate R with different formats or capabilities.

morrone commented 6 years ago

The question I keep struggling with is why you'd want to do that work in your privileged process if you don't have to?

Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.

One example I can think of is if you restricted access to a license server through some sort of iptables rules.

That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server.

grondo commented 6 years ago

That is interesting...is that how it usually works? I was thinking more that some number of floating licenses would be allocated to an entire job, and they could choose to use them where they like. But I guess it depends on the sophistication of the particular license server.

I actually don't know, but it was a real proposal from somewhere (but not sure if it was ever implemented).

Me too! Which is why adding non-local things to Rlocal that need to be skipped seems like exactly the same kind of processing in a privileged process that we were trying to avoid by having Rlocal in the first place. I don't see might difference between skipping 3 things or skipping 1000. The parsing code needs to be rock solid in either case, so the implementation work seems the same. But if they are the same, then why put the extra effort into implementing Rlocal in the first place.

You do make good points. I guess you are implementing Rlocal in either case, it is just a case of where it is generated. I say you have my consensus that non-local things should not go into Rlocal, and lets leave it at that for now. However, also take a look at my proposal above at possibly using instance provided code to parse R from within the IMP by running it in an unprivileged child.

Either way, I don't think we're closing the door on including either more or less resources in the R in future implementations, so I'm thinking we can move on for now...

morrone commented 6 years ago

However, also take a look at my proposal above at possibly using instance provided code to parse R from within the IMP by running it in an unprivileged child.

The devil is in the details there. At some point the IMP needs to ingest resource data in some form from an untrusted source. It could either be by parsing a resource document itself, or from parsing the output of something that parses the resource document. But at some point it always needs to be defensive and validate its input. I would need to know more to evaluate it, I think.

I'm not really clear why parsing R seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R could be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled.

grondo commented 6 years ago

I'm not really clear why parsing R is more seems more scary than parsing Rlocal, or why it would necessarily need to be handled separately through plugins and/or resource separation. I think R would be handled by an internal parser in the IMP exactly the same way that Rlocal is being proposed to be handled.

Ok, again I understand your point.

I guess I'm arguing that R_local as used internally by the IMP is a different, much simpler, format than R (sorry, maybe it shouldn't be called Rlocal anymore?). The amount of code being used would therefore be less, and therefore, provably less bugs.

The Rlocal format could evolve much more slowly than R, though I admit it hasn't been proven that R will change at any kind of pace that would require frequent updates to flux-security project, so perhaps a weaker argument here.

Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal....

lipari commented 6 years ago

Also it just kind of seems to make sense to send less data to the IMP, even though this doesn't have a security argument. For a job with 1000 cores on 1000 nodes, R is potentially 1000x the size of Rlocal....

Totally agree.

grondo commented 6 years ago

To summarize, I think consensus here is that R_local should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP.

As far as the topic of this issue, which is the specification of R, I don't think that changes much. We still need to be able to generate some R_local from R, and therefore we'll need some kind of format of R that allows this within in an instance.

dongahn commented 6 years ago

To summarize, I think consensus here is that Rlocal should contain only node local resources, but that it is still useful to send only a subset of R, perhaps in a simpler format, as input to the IMP.

IMHO this approach is sound.

As I already discussed this w/ @grondo, extracting R_local should be fully distributed so that a centralized component doesn't become a scalability bottleneck.

For needing to fetch R by many execution service modules to extract R_local , I believe this should be scalable as this would essentially have performance complexity of a broadcast... At some point, we may want to measure this though..

Sorry if this was too obvious.

dongahn commented 6 years ago

I would like to have a bit more discussions on the main topic of this issue: R format. As @grondo nicely captured this in the beginning, R will serve as the input and output of a range of components. For example, it will be the input to the resource service (resource selection part of scheduling) in a nested instance, as well as remote execution service, and job shell. Similarly, it will be the output of the resource service, and also of other related services, utilities, and even manual effort, e.g., :

hwloc (probably via other services that help with formatting);
GRUG through resource service;
RDL through resrc;

It seems an important decision we can all benefit from at this point would be whether we want to spec out a common R format or going with an opaque approach with just common abstractions on it agreed upon.

W/o looking at this too closely, if we go with a graph format with an optional ability to annotate extra information on resource vertex and edge (i.e., the concepts described in RFC4), we should be able to describe the format captured in all of the above use cases, and I can contribute to that effort based on my currentresource-query experience.

But a part of me is also saying whether this kind of rigor on the format is necessary at this point.

Another approach can be for each of format variants, we require a library to expose a set of common operations including "reader" and "writer."

The former would a bit more rigorous but at the same time it can be a bit more time consuming. But maybe something that we ought to do anyway.

There is also the third possibility which is to start to go with the opaque approach above but as we reach agreements on the common abstractions, we will know the requirement on the format better. And at that point, we can formalize the format. It seems we will have to write libraries that expose those abstractions around the R anyway, this also is not a bad idea IMHO either.

Thoughts?

grondo commented 6 years ago

Thanks @dongahn. Some very good thoughts above.

Another element we need to keep in mind is how various components of flux would manage dependencies of interpreting and managing R. Ideally perhaps, R format would be supported directly by flux-core, so that the execution system, which depends on it, can be tested stand-alone. However, this approach might lead to a lowest-common denominator format, which may not support the needs of advanced resource services and/or schedulers.

An argument might also be made the the R format is solely the domain of a resource service, and the R interpreter and generator should therefore by offered by that service, though that would leave the execution service dependent on resource services being installed, which might not be what we want.

Another approach would be to have the R spec opaque as you said, with each type supplying a corresponding API that satisfies the requirements of all use cases outside of resource service internals. Somehow the required implementation would be encoded in R itself and the correct implementation loaded at runtime.

One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live.

dongahn commented 6 years ago

Of four possibilities, the last two seem most attractive from my perspective. A bit on the 4th option:

One more idea would be to have a very basic R specification, but allow a section for "extensions" which might be ignored by most components, but used for any extra information needed by the resource service itself. The base R spec might not even need an API if it was simple enough, thereby removing the pain of deciding where the dependent libraries might live.

This is a very interesting idea from my perspective @grondo. I consider the resource representation needed by flux-core elements as a "proper" subset of the representation needed by resource so this can work out nicely if we can reasonably separate out the baseline from the extension. I don't know if "section" is the right construct but I got the idea.

Why don't I put a few examples of the graph representations I plan to use by resource and see what belongs to the baseline and what belongs to the core elements, and see if these are easily separable. I will use GraphML but other markup language capable of graph can do as well.

grondo commented 6 years ago

sounds good @dongahn! Thanks!

dongahn commented 6 years ago

Here is a GraphML example that describes an R such that it has 1 cluster with 1 rack with 1 node with 2 sockets each with 2 cores/1 GPU/memory. This was actually emitted from my resource-query utility. I will explain parts of this a bit in a separate posting as relevant to this ticket.

But the general description on GraphML itself can be found in something like this. The reason that I used GraphML is that, our resource model is essentially is a graph and as such I didn't think I needed to reinvent the wheel with other markup languages. Plus, there are already plenty of GraphML readers and writers out there including Boost Graph Library.

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key id="basename" for="node" attr.name="basename" attr.type="string" />
  <key id="id" for="node" attr.name="id" attr.type="long" />
  <key id="subsystems" for="node" attr.name="member_of" attr.type="string" />
  <key id="esubsystems" for="edge" attr.name="member_of" attr.type="string" />
  <key id="name" for="node" attr.name="name" attr.type="string" />
  <key id="paths" for="node" attr.name="paths" attr.type="string" />
  <key id="props" for="node" attr.name="props" attr.type="string" />
  <key id="size" for="node" attr.name="size" attr.type="int" />
  <key id="type" for="node" attr.name="type" attr.type="string" />
  <key id="unit" for="node" attr.name="unit" attr.type="string" />
  <graph id="G" edgedefault="directed" parse.nodeids="canonical" parse.edgeids="canonical" parse.order="nodesfirst">
    <node id="n0">
      <data key="basename">small</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">small0</data>
      <data key="paths">{containment: &quot;/small0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">cluster</data>
      <data key="unit"></data>
    </node>
    <node id="n1">
      <data key="basename">rack</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">rack0</data>
      <data key="paths">{containment: &quot;/small0/rack0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">rack</data>
      <data key="unit"></data>
    </node>
    <node id="n2">
      <data key="basename">node</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">node0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">node</data>
      <data key="unit"></data>
    </node>
    <node id="n3">
      <data key="basename">socket</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">socket0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">socket</data>
      <data key="unit"></data>
    </node>
    <node id="n4">
      <data key="basename">socket</data>
      <data key="id">1</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">socket1</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket1&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">socket</data>
      <data key="unit"></data>
    </node>
    <node id="n5">
      <data key="basename">core</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">core0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0/core0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">core</data>
      <data key="unit"></data>
    </node>
    <node id="n6">
      <data key="basename">core</data>
      <data key="id">1</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">core1</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0/core1&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">core</data>
      <data key="unit"></data>
    </node>
    <node id="n7">
      <data key="basename">core</data>
      <data key="id">2</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">core2</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket1/core2&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">core</data>
      <data key="unit"></data>
    </node>
    <node id="n8">
      <data key="basename">core</data>
      <data key="id">3</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">core3</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket1/core3&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">core</data>
      <data key="unit"></data>
    </node>
    <node id="n9">
      <data key="basename">gpu</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">gpu0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0/gpu0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">gpu</data>
      <data key="unit"></data>
    </node>
    <node id="n10">
      <data key="basename">gpu</data>
      <data key="id">1</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">gpu1</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket1/gpu1&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">gpu</data>
      <data key="unit"></data>
    </node>
    <node id="n11">
      <data key="basename">memory</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">memory0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0/memory0&quot;}</data>
      <data key="props"></data>
      <data key="size">4</data>
      <data key="type">memory</data>
      <data key="unit"></data>
    </node>
    <node id="n12">
      <data key="basename">memory</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">memory0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket1/memory0&quot;}</data>
      <data key="props"></data>
      <data key="size">4</data>
      <data key="type">memory</data>
      <data key="unit"></data>
    </node>
    <edge id="e0" source="n0" target="n1">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e1" source="n1" target="n0">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e2" source="n1" target="n2">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e3" source="n2" target="n1">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e4" source="n2" target="n3">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e5" source="n2" target="n4">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e6" source="n3" target="n2">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e7" source="n3" target="n5">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e8" source="n3" target="n6">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e9" source="n3" target="n9">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e10" source="n3" target="n11">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e11" source="n4" target="n2">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e12" source="n4" target="n7">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e13" source="n4" target="n8">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e14" source="n4" target="n10">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e15" source="n4" target="n12">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>
    <edge id="e16" source="n5" target="n3">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e17" source="n6" target="n3">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e18" source="n7" target="n4">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e19" source="n8" target="n4">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e20" source="n9" target="n3">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e21" source="n10" target="n4">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e22" source="n11" target="n3">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
    <edge id="e23" source="n12" target="n4">
      <data key="esubsystems">{containment: &quot;in&quot;}</data>
    </edge>
  </graph>
</graphml>

dongahn commented 6 years ago

As you can see from:

  <key id="basename" for="node" attr.name="basename" attr.type="string" />
  <key id="id" for="node" attr.name="id" attr.type="long" />
  <key id="subsystems" for="node" attr.name="member_of" attr.type="string" />
  <key id="esubsystems" for="edge" attr.name="member_of" attr.type="string" />
  <key id="name" for="node" attr.name="name" attr.type="string" />
  <key id="paths" for="node" attr.name="paths" attr.type="string" />
  <key id="props" for="node" attr.name="props" attr.type="string" />
  <key id="size" for="node" attr.name="size" attr.type="int" />
  <key id="type" for="node" attr.name="type" attr.type="string" />
  <key id="unit" for="node" attr.name="unit" attr.type="string" />

a resource pool vertex contains 9 ~ 10 base fields (uuid omitted for now):

basename
id
name
path (JSON dictionary string)
properties (JSON dictionary string)
size
type
unit
subsystems (JSON dictionary whose keys are subsystems that this vertex belongs to)

For example, the following resource vertex construct describes a socket resource (socket0) contained within node0. Note that &quot is an xml encoding scheme for double quote ("):

    <node id="n3">
      <data key="basename">socket</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">socket0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">socket</data>
      <data key="unit"></data>
    </node>

The following vertex construct describes a core resource (core0) contained within socket0:

    <node id="n5">
      <data key="basename">core</data>
      <data key="id">0</data>
      <data key="subsystems">{containment: &quot;*&quot;}</data>
      <data key="name">core0</data>
      <data key="paths">{containment: &quot;/small0/rack0/node0/socket0/core0&quot;}</data>
      <data key="props"></data>
      <data key="size">1</data>
      <data key="type">core</data>
      <data key="unit"></data>
    </node>

Each edge construct describes a directional relationship between two resource vertices and has one data field. For example, the following describes a relational edge from socket0 vertex (whose vertex id is n3) to core0 vertex (n5).

    <edge id="e7" source="n3" target="n5">
      <data key="esubsystems">{containment: &quot;contains&quot;}</data>
    </edge>

Now, my guess is that anything but subsystem and esubsystem fields would be the baseline R as needed by execution systems and other generators. But we should discuss.

In general, I believe R that is capable of describing a graph would be the most expressive way to support a wide range of producers and consumers. For example, a nestedresource service can simply pass R from the guest KVS to read_graphml to populate its graph data store before starting the nested scheduling service.

Other generators like a hwloc adapter can also easily emit its information to this format. hwloc won’t have the concept of subsystems so the subsystem and esubsystem fields would be empty, however.

One thing that bothers me a little is this encodes graph data in a bit redundant manner. But this is because the preexisting graphml writer emits the graph in this manner. Thus, this can be compressed to a good degree if we decide to enhance our own writers.

lipari commented 6 years ago

While GraphML makes sense as a resource representation to the scheduler, I think the resources passed to the execution engine on each node (still called R_local?) should be conveyed via hwloc's topology construct (https://www.open-mpi.org/projects/hwloc/doc/v1.2/structhwloc__obj.php). This will facilitate the binding the execution engine needs to do, leveraging the power of the hwloc library: https://www.open-mpi.org/projects/hwloc/doc/v1.2/group__hwlocality__cpubinding.php. I realize that binding is not a subject of this issue - and that we may opt to use cgroups directly instead of through the hwloc library. However, I wanted to raise the comment that the graph format for R should be different from the node-focused representation of R_local.

dongahn commented 6 years ago

Thanks @lipari. I believe that the stated plan is to do containment via a plugin to IMP and the binding at the jobshell level. Both will need adaptors to extract the info from the R to the format they require.

@grondo's proposal to R local is up there. How jobsell will access the R was discussed yesterday and there is a PR #114 as a result of that.

dongahn commented 6 years ago

BTW, as I remember hwloc also supports reading from and writing to xml. My guess is translating the entire R to this xml format should be pretty straightforward.

lipari commented 6 years ago

Sorry, @dongahn. I read the PR 114 discussion of mods to RFC 16 as describing how the GraphML formatted R is represented in the KVS. It is unclear to me where we have specified the work of translating a GraphML R to the R_local form and where the R_local's format is defined. I thought I read all the discussion, but I probably missed it.

lipari commented 6 years ago

To clarify... I think it is asking a lot of the IMP to break a GraphML-formatted R down to R_local which needs to look like an hwloc topology object - even with an adapter. I'm questioning whether that is the best plan.

dongahn commented 6 years ago

We am open to alternatives, of course.

If the node-local resource discovery speed is an issue, i suppose we can add some index information such that node-local information can be found more expediantly...

What is your alternative, @lipari?

dongahn commented 6 years ago

... translating a GraphML R to the Rlocal form and where the Rlocal's format is defined. I thought I read all the discussion, but I probably missed it.

https://github.com/flux-framework/rfc/issues/109#issuecomment-334306854 would be closest, I think.

grondo commented 6 years ago

Sorry, just catching up on this.

@lipari, some clarification: the execution subsystem creates R_local, but it is part of the enclosing instance and therefore can use instance services to create this representation. The exec service doesn't do binding, that is completed by the IMP, and therefore R_local will be in a format defined by the IMP, likely something very simple, not GraphML.

lipari commented 6 years ago

109 (comment) would be closest, I think.

Ah, thank you for the reminder.

What is your alternative, @lipari?

@dongahn, you are proposing GraphML to be the form of R. And I agree that is a sound proposal for the reasons you stated. But to be confident that GraphML is the best choice, it behooves us to consider how easily GraphML can be parsed to generate the node-focused R_local. And I was attempting to minimize the complexity of the IMP, if it needs to generate R_local. If this is turns out to be no problem for the IMP, then I'm fine with using GraphML to represent R.

grondo commented 6 years ago

This will facilitate the binding the execution engine needs to do, leveraging the power of the hwloc library: https://www.open-mpi.org/projects/hwloc/doc/v1.2/group__hwlocality__cpubinding.php.

The job shell might use hwloc for binding, but the IMP will need to create a container from R_local and I don't think the hwloc topology format nor the library itself will be helpful there.

grondo commented 6 years ago

And I was attempting to minimize the complexity of the IMP, if it needs to generate Rlocal.

I think we concluded that the most efficient approach is that R_local is input to the IMP, not generated by the IMP.

If it is later decided that the input to the IMP should be R, from which the IMP would generate its own R_local, then I would expect the IMP would utilize a library from the enclosing instance to operate on R.

In either case the IMP will always treat R as opaque and should not need to know its specific format.

dongahn commented 6 years ago

And I was attempting to minimize the complexity of the IMP, if it needs to generate Rlocal. If this is turns out to be no problem for the IMP, then I'm fine with using GraphML to represent R.

I thought you had a good point in that node-local resource discovery from R should be efficient. I am sure we will revisit this as "optimization" topic as we get down to the implementation...

dongahn commented 6 years ago

To summarize the discussion points so far:

There seem good benefits in actually defining our spec for R that supports baseline + extensions, and GraphML format is proposed.
The execution engine will create R_local and pass that as the input to the IMP
Jobshell will get to R to do binding
We may need support for efficiently discovering node-local resources

From my perspective, gathering a bit more feedback on the GraphML proposal will help.

lipari commented 6 years ago

Thank you for the summary, @dongahn! It seems to me this discussion has bearing on whether to make GraphML be the format of R.

grondo commented 6 years ago

Thanks @dongahn.

It seems to me this discussion has bearing on whether to make GraphML be the format of R.

Yeah, the question is whether GraphML can be the baseline format of R

Jobshell will get to R to do binding

The job shell, as well as any sub-instance started by the shell may require R in such a format that it can be parsed and used for binding or as configuration input without a special library from the parent, which may be running as a different user. This makes me think that a simpler "baseline format" may be required. Instance implementations may still use GraphML internally, but they would need a converter to and from the baseline format, with extra information stored in the extensions. At this point I'm not sure otherwise how the job shell and other instances running potentially different versions of Flux could reliably and safely make use of R directly.

dongahn commented 6 years ago

Just for an argument sake,

The job shell, as well as any sub-instance started by the shell may require R in such a format that it can be parsed

GraphML is an XML and this can be parsed easily.

and used for binding

The parsed objects can be easily traversed and used for binding.

as configuration input without a special library from the parent, which may be running as a different user.

Again, this is just an XML and its dependency is xml not a special library from the parent instance.

This makes me think that a simpler "baseline format" may be required.

What do you have in mind?

This makes me think that a simpler "baseline format" may be required. Instance implementations may still use GraphML internally, but they would need a converter to and from the baseline format, with extra information stored in the extensions.

Is the worry having a baseline section and exertion section such a way that one can only parse the baseline section for baseline operation?

At this point I'm not sure otherwise how the job shell and other instances running potentially different versions of Flux could reliably and safely make use of R directly.

Again, GraphML is an XML; so you don't have to require a special library from the parent as far as the format is formally specified.

grondo commented 6 years ago

GraphML is an XML and this can be parsed easily.

Ok, I was worried that boost::graph library or special API from parent instance would be required.

What do you have in mind?

I don't have a good idea. But simple baseline format could be a yaml representation similar to that used in jobspec. This would cut the library requirements down from xml+yaml to just yaml. This is just one idea.

Is the worry having a baseline section and exertion section such a way that one can only parse the baseline section for baseline operation?

Not so much a worry as a way to have a baseline format that can be documented in an RFC and which all flux instances and utilities can understand. Then, for advanced schedulers, etc, the extensions section could be filled with extra information needed for the specific implementation.

Again, GraphML is an XML; so you don't have to require a special library from the parent as far as the format is formally specified.

Ok, understood. I was possibly confused about how parsing of GraphML works. It would still be nice to keep the number of different encodings down (we'll now have json, yaml, and xml), but for now I'm fine with whatever is most efficient for you.

dongahn commented 6 years ago

Ok, I was worried that boost::graph library or special API from parent instance would be required.

There should be a number of libraries beyond boost that can do this. Probably good idea to document those. I will do that.

I don't have a good idea. But simple baseline format could be a yaml representation similar to that used in jobspec. This would cut the library requirements down from xml+yaml to just yaml. This is just one idea.

YAML probably works as well. But we will need to extend constructs to support edges and etc. I'm not saying this cannot be done, but like I said, if the graph constructs are already there in GraphML -- tested and hardened with existing libraries, my rational was why reinventing those... For jobspec, YAML makes sense because it needs to be more human readable. But for R, this will be likely meant for "machines" to process more than for human users...

Ok, understood. I was possibly confused about how parsing of GraphML works. It would still be nice to keep the number of different encodings down (we'll now have json, yaml, and xml), but for now I'm fine with whatever is most efficient for you.

I thought about introducing why yet another encoding, xml. But in reality I realized we have already introduced it, as hwloc uses that encoding.

grondo commented 6 years ago

I thought about introducing why yet another encoding, xml. But in reality I realized we have already introduced it, as hwloc uses that encoding.

Yes, but there are no real users of that in flux as yet. (the xml topology is just passed to other hwloc commands)

YAML probably works as well. But we will need to extend constructs to support edges and etc. I'm not saying this cannot be done, but like I said, if the graph constructs are already there in GraphML -- tested and hardened with existing libraries, my rational was why reinventing those...

Thanks, good points. I don't want to stall your progress with my minor questions. However it has been useful for my understanding, so thanks!

There is also the benefit that there seem to many graph viewers that take in GraphML (e.g. Gephi).

It might help me if we could work through how some simple use cases might work using GraphML from above... e.g. find the correct layout of tasks across R for simple task slot shapes (e.g. 1 core, 1 socket, 1 socket, 1 core, etc).

flux-framework / rfc

Need specification for "resource set", R #109

109 (comment) would be closest, I think.