codeplaysoftware / standards-proposals

Repository for publicly sharing proposals in various standards groups
Apache License 2.0
27 stars 17 forks source link

CP013: Topology resources awareness of contention (P1795) #109

Open AerialMantis opened 4 years ago

AerialMantis commented 4 years ago

Some feedback was received in the Belfast meeting that it would be useful to identify whether resources within a system topology are contested and be able to discover only the parts of the system topology which are non-contested, perhaps via some kind of flag.

The current design doesn't make any guarantee as to whether the resources reflected in the system topology are uniquely available and uncontested by another part of the application or another process. I would be beneficial to define when users can expect to have uncontested access to resources when it's possible for the implementation to do so and provide a way to only discover resources that are available. Though this might have to be done at a fine-grained level as some resources may not be able to reflect this information and some resources may only be partially contested, for example, a bounded thread pool may take a specific number of threads.

I can think of three different situations where this information could be available when discovering the system topology:

Perhaps this is something which needs to be queryable on a per-resource basis.

Another point to consider here is that whether resources are contested could change dynamically, so it would have to factor in consideration about how the topology is updated and how users are notified of changes.

mhoemmen commented 4 years ago

I can think of two different definitions of "not contested":

  1. "I have unique ownership of the resource; no other processes can use it while I hold it."
  2. "At this moment, and only at this moment, no other processes are using the resource."
mhoemmen commented 4 years ago

I'm thinking, for example, of a GPU with "streams." The first definition above means that no other streams are allowed to exist other than my stream. The second definition means that even if other streams exist, none of them have a kernel currently running on the GPU.

AerialMantis commented 4 years ago

@mhoemmen this is an interesting point, so I was thinking along the lines of 2, though at the granularity of contexts, so "At this moment, and only at this moment, no other processes maintain a context which could use this resource.". Though I think both of these could be useful.

Your other definition (1) is also interesting, that one may be harder for some implementations to support, as it requires a guarantee that no other process with use the resource, so would likely need to be done at the OS level. But that would be very useful.

mhoemmen commented 4 years ago

@AerialMantis wrote:

this is an interesting point, so I was thinking along the lines of 2, though at the granularity of contexts, so "At this moment, and only at this moment, no other processes maintain a context which could use this resource.". Though I think both of these could be useful.

I'm thinking of a specific example that we care about. Suppose I have an application that uses OpenMP. I would like to call an opaque library either inside or outside my OpenMP parallel regions, and I want the library to use any available parallel resources. The library might want to spin up its own threads, using some programming model other than OpenMP. However, there's no interface for me to give the library a nested executor. (I'm imagining a BLAS library -- this relates to our linear algebra proposal a bit, since the typical way vendors would initially implement it is by wrapping their existing BLAS library.)

The library might want to know the following:

  1. Are the hardware resources available to me in contention, right now?
  2. Are the hardware resources available to me always in contention?
  3. Are there other hardware resources I could spin up that aren't in contention?

Answers to (1) and (2) depend on the OpenMP implementation. If it spin-locks, then the answer is usually always yes, even outside of a parallel region. If the OpenMP implementation just dumps work into a common task queue managed by the operating system, then the answers might be yes or no, depending on whether the library is invoked inside of a parallel region, and whether the task queue has enough resources.

For (3), the library would need to walk the system topology and ask if components are "not being used by somebody else," but what that means relates to (1) and (2).

mhoemmen commented 4 years ago

This "contention" thing feels too much like an operating system thing, and not a thing about which C++ can say something meaningful. Many operating systems run multiple processes written in different programming languages (or different versions of the same language) that share hardware. How can C++ reach outside of its process into the operating system and ask it whether other processes are running on those hardware? That almost feels like a security risk. Is a core "contended" if the operating system occasionally runs an I/O process on it?

This also feels too tied to specific hardware. Some CPUs (e.g., those with a "barrel" processor) only run efficiently if many independent threads or processes share them. Does that make them "contended"? What if the number of threads that a barrel processor can serve at once is a hardware implementation detail that users can't see? What if the OS can spin up a coprocessor if the system load is high, but may choose not to do so if the computer is too hot?

In addition, the definition of "contention" is tied to the workload. What if most of the threads are blocked on I/O, leaving the core mostly free to do work? If the tasks are memory bandwidth - bound, then it may be necessary to undersubscribe the cores in order to get good performance.

I still think this "find uncontended resources" feature could be useful inside a process. The BLAS library example above would be immediately useful to us. I have a PhD dissertation by Heidi Pan in mind that nests a sparse QR factorization with BLAS library calls. She solved the problem with a common low-level run-time environment that the different programming models (Pthreads and OpenMP) would need to share. I think the C++ way of solving this problem would be to pass in a nested executor, but C++ still needs to interoperate with opaque libraries that aren't executor-aware.

AerialMantis commented 4 years ago

Sorry, I'm so late in replying to this.

This "contention" thing feels too much like an operating system thing, and not a thing about which C++ can say something meaningful. Many operating systems run multiple processes written in different programming languages (or different versions of the same language) that share hardware. How can C++ reach outside of its process into the operating system and ask it whether other processes are running on those hardware? That almost feels like a security risk. Is a core "contended" if the operating system occasionally runs an I/O process on it?

I agree, I think it makes sense for us to limit the definition for contention to the current process. Going beyond the this would require support at the OS level and as you mentioned it may not always be possible to give an accurate answer and it could introduce potential security risks.

So this just leaves representing contention between different C++ executors and between a C++ executor and a third-party programming model.

I'm thinking of a specific example that we care about. Suppose I have an application that uses OpenMP. I would like to call an opaque library either inside or outside my OpenMP parallel regions, and I want the library to use any available parallel resources. The library might want to spin up its own threads, using some programming model other than OpenMP. However, there's no interface for me to give the library a nested executor. (I'm imagining a BLAS library -- this relates to our linear algebra proposal a bit, since the typical way vendors would initially implement it is by wrapping their existing BLAS library.)

...

I still think this "find uncontended resources" feature could be useful inside a process. The BLAS library example above would be immediately useful to us. I have a PhD dissertation by Heidi Pan in mind that nests a sparse QR factorization with BLAS library calls. She solved the problem with a common low-level run-time environment that the different programming models (Pthreads and OpenMP) would need to share. I think the C++ way of solving this problem would be to pass in a nested executor, but C++ still needs to interoperate with opaque libraries that aren't executor-aware.

I think this is a really good example of where this kind of property could be useful, and I think the distinction you made of whether resources are in contention temporarily or permanently in a given scope and the ability to request resources that are not in contention are really important.

Though I expect how much of this information is available will depend on whether all tasks are performed by C++ executors or whether there are tasks that are being performed in a third-party programming model.

The former should be relatively straight forward for an implementation to provide, even for different underlying programming models, via executors who are aware of each other's execution contexts or executors which can provide sub-executors. Ideally, the system topology representation that we propose could be a common layer to these various different kinds of executors (like the one in the PhD dissertation you described) in order to facilitate this across C++ in a generic way.

The latter is more complicated, though also very important. I think in this case we have to assume that the third-party programming model is completely unaware of the executors, in which case any information about contention must come from querying third-party programming models. I expect not all programming models will be able to provide this information and if it can there's no way an executor can guarantee that another programming model it's unaware of isn't contesting its resources. So perhaps this level of query would have to be an optional hint only?

This also feels too tied to specific hardware. Some CPUs (e.g., those with a "barrel" processor) only run efficiently if many independent threads or processes share them. Does that make them "contended"? What if the number of threads that a barrel processor can serve at once is a hardware implementation detail that users can't see? What if the OS can spin up a coprocessor if the system load is high, but may choose not to do so if the computer is too hot?

This case is very interesting, I think as long as our definition of contention does not exceed the current process, we could define it terms of the executor's capacity to invoke work on the resource, so I think the specific state of the hardware resources could be abstracted.

So in the case of a barrel processor, perhaps contention could be inferred by the recommended number of threads that should be used at once? In the case that the OS dynamically spins up a thread to manage a high work-load, I think this could be managed by the resources transparently, though I imagine this may have to take concurrency from another resource.

In addition, the definition of "contention" is tied to the workload. What if most of the threads are blocked on I/O, leaving the core mostly free to do work? If the tasks are memory bandwidth - bound, then it may be necessary to undersubscribe the cores in order to get good performance.

This is a good point, the definition of contention should incorporate I/O as well as dedicated compute tasks, though when querying the contention of resources it may not always be possible to know which threads are being used for what.

mhoemmen commented 4 years ago

@AerialMantis wrote:

Though I expect how much of this information is available will depend on whether all tasks are performed by C++ executors or whether there are tasks that are being performed in a third-party programming model.

I see integration with third-party programming models as QoI. Vendors can and do use a common run-time environment for different languages and parallel programming models.

Ideally, the system topology representation that we propose could be a common layer to these various different kinds of executors (like the one in the PhD dissertation you described) in order to facilitate this across C++ in a generic way.

That's a good start, but I think cooperative sharing of resources (as in that dissertation) really calls for a common pool of resources. High-quality implementations will want to do that, though, and they will want insight into system topology regardless, so perhaps we don't have to worry about specifying a common resource pool.

AerialMantis commented 4 years ago

I see integration with third-party programming models as QoI. Vendors can and do use a common run-time environment for different languages and parallel programming models.

That's true, perhaps I am being too conservative in my definition here. My main concern was avoiding false positives, i.e. resources claiming non-contention when they in fact are, but I think this is impossible to fully guarantee. Maybe this should be QoI, have it be a hint which reflects a best estimate, so doesn't guarantee non-contention but tells you to the best of its knowledge?

That's a good start, but I think cooperative sharing of resources (as in that dissertation) really calls for a common pool of resources. High-quality implementations will want to do that, though, and they will want insight into system topology regardless, so perhaps we don't have to worry about specifying a common resource pool.

I agree, I think this is perhaps a QoI issue as well, as long as we provide the layer which allows vendors to reflect the resources within the system, implementations can choose to introduce common resource pools on top to provide better cooperation between different execution contexts.