Mechanism for querying physical queue topology?

haasn commented 7 years ago

Many drivers seem to have wildly different interpretations of what a VkQueue represents. In some drivers (e.g. nvidia on maxwell), all VkQueues are just time sliced on the same execution unit, and there is zero performance benefit from using more than one (in my testing). Other drivers (e.g. amdgpu) map their queues to underlying compute pipes in various ways. The latter is especially obvious when observing the state of the compute queues in gpuvis, where they will be labelled something like comp_1.0.0, comp_1.0.1, comp_1.1.0 etc. indicating the underlying topology (i.e. which queues map to which compute pipe).

Due to this confusion, and the wildly varying performance characteristics that can be obtained as a result of using these queues in different ways, it might be a good idea to add some sort of mechanism for querying the underlying physical layout of queues - so applications can make sure to not over-submit work that won't run in parallel either way. (To avoid losses due to what I assume are expensive context switches, which appear to make quite a difference on e.g. nvidia)

I'm not sure exactly what the API would look like, but the basic question I'd be hoping to answer in my applications are “which queues can run concurrently with which other queues?”, perhaps by command type (e.g. compute, graphics, transfer).

haasn commented 7 years ago

I also noticed that the mapping of VkQueue index to compute pipe ID is not static either, it seems to vary, probably based on how many the application requests. So an appropriate API for this might simply be adding a second “queueCount”-like attribute that tells me how many VkQueues I can request from the queue family and still have them map to separate parallel execution units.

haasn commented 7 years ago

Perhaps as an alternative, what could also make sense is specifying that such “multiple parallel pipes” should show up as separate queue families, rather than as separate VkQueues of the same family.

In that universe, AMD devices (which have 4 underlying compute pipes, each with 8 different queues) could expose 4 separate queue families each with 8 queues; mapping to comp_1.0.0 - comp_1.0.7 respectively for the first QF, comp_1.1.0 - comp_1.1.7 etc. for the second QF, and so on.

In this universe, the semantics given to the programmer would be that “using multiple queue families can improve performance via true parallel execution, using multiple queues of the same queue family will just time-slice”. But I'm not sure if that's a realistic assumption to make for all current and future GPUs.

NicolBolas commented 7 years ago

When you say that NVIDIA only does time slicing for its queues, are you saying that this is for all of its queues, or just queues within a queue family?

and there is zero performance benefit from using more than one (in my testing).

That rather depends on what you're doing. And more importantly, for how responsive you want your program and system to be. Responsiveness is, after all, why threading and multitasking existed on PCs long before multi-core CPUs were common.

Your point is valid: if you're doing straight, in-order linear processing, with all processes ultimately terminating at a single image to be displayed, multiple queues in such a system will not help you. But that's not everyone's needs.

the basic question I'd be hoping to answer in my applications are “which queues can run concurrently with which other queues?”

I don't think that's the best question to be asking. By the time you have an actual VkQueue object, it's too late to start deciding how to apportion queues, since that has to be done at device creation time.

The questions I think we need to ask are:

Do queues from queue family X run entirely independently from queues in queue family Y? Or perhaps from all other queue families?
Do queues within queue family X run entirely independently from each other?

I think these questions should be focused specifically on execution, not on resource contention. To "execute entirely independently" means that the two queue/families will not task switch between each other. That they will concurrently execute.

Resource contention is too muddy of a topic for this sort of thing; trying to resolve that can easily lead to worse performance. When developing queue organizations that deal with contention, there are too many contention patterns that the only effective solution is to detect specific hardware and prepare a scheme specifically for that.

So an appropriate API for this might simply be adding a second “queueCount”-like attribute that tells me how many VkQueues I can request from the queue family and still have them map to separate parallel execution units.

You assume that this number will ever be either 1 or queueCount itself. It seems to me that if multiple queues within a family can execute entirely independently, IHVs wouldn't bother providing a larger queueCount than the independent execution limit. Indeed, I would rather force IHVs into doing it this way. Either a queue family provides independent execution of all of its queues, or it provides independent execution for none of them.

So I think a boolean would be sufficient.

haasn commented 7 years ago

When you say that NVIDIA only does time slicing for its queues, are you saying that this is for all of its queues, or just queues within a queue family?

Within a queue family. A typical nvidia driver exposes a single GRAPHICS | COMPUTE | TRANSFER | SPARSE queue family with 16 VkQueues[1]. The explanation I've gotten via IRC from various people is that these VkQueues are time-sliced, rather than truly concurrent. (i.e. there's (apparently?) no way to make one VkQueue do useful computations while the other one is waiting for memory operations)

[1] And an extra TRANSFER queue family with 1 VkQueue.

That rather depends on what you're doing. And more importantly, for how responsive you want your program and system to be. Responsiveness is, after all, why threading and multitasking existed on PCs long before multi-core CPUs were common.

That is true; time-slicing + queue priorities in theory allow you to do soft-preemption to make sure your “get results onto the screen” commands run with a higher priority than your “pre-compute some work for the future” commands. I should perhaps clarify that this is irrelevant in my application: I have no latency guarantees whatsoever, and raw throughput is the only thing that matters.

I think these questions should be focused specifically on execution, not on resource contention. To "execute entirely independently" means that the two queue/families will not task switch between each other. That they will concurrently execute.

I think it's important to point out that “task switch” and “concurrently execute” are fuzzy descriptions, also. For example, what AMD devices seem to do internally is schedule their 4 compute pipes onto the CUs in a hyper-threading like manner: Submitting the same frame twice to different pipes won't give you 2x throughput, but it would allow the underlying hardware to perform useful compute work from one pipe while waiting for memory operations in another. So even this wording can be ambiguous.

You assume that this number will ever be either 1 or queueCount itself.

No, I have a more concrete example in mind: amdgpu hardware supports 4 compute pipes but they expose 8 VkQueues. So they would signal this as ‘4’ to indicate that requesting 4 VkQueues gives you the most optimal mapping. To clarify, this is what the situation looks like on recent AMD hardware, when you request 8 VkQueues. They end up mapping to the 4 underlying compute pipes in an (N / 2), (N % 2) sort of fashion. (N being the index of the VkQueue). Optimal throughput is achieved by requesting 8 queues and then only using the even ones (which map to separate compute pipes).

gpuvis

Indeed, I would rather force IHVs into doing it this way. Either a queue family provides independent execution of all of its queues, or it provides independent execution for none of them.

That would be a valid approach, but I think this may lose the useful property of VkQueues with explicit priorities being able to pre-empt each other; which, as you say, is important for latency-critical applications. Either way, I would be happy with this approach, I think.

haasn commented 7 years ago

Optimal throughput is achieved by requesting 8 queues and then only using the even ones (which map to separate compute pipes).

It's worth pointing out that apparently the kernel devs are willing to change this so that requesting 4 VkQueues will give you 4 separate compute pipes, always. Which means a “optimalQueueCount” would definitely be a sufficient (if perhaps not necessary) solution to this issue.

krOoze commented 5 years ago

@haasn I am starting to see it your way. The queue families are too abstract and give zero information about how they would behave. Even the drivers seem confused on what they should report.

Even imperfect information is probably better than nothing here.

There also seem to be a collision of several differing concerns with the queues (and families): 1) From how many threads the driver can receive work with no sync (or advantageous sync the user cannot achieve himself) 2) How many GPU HW Compute Units can indepentently receive and execute work 3) Whether the goal is to time-slice (in order to not starve any queue), or to maximize throughput (by occasionally filling empty Compute Unit with work from a different queue)

Also the API does not (nor spec) tell which things should be avoided:

1) Using the async compute and the compute on the graphics+compute family at the same time is probably bad 2) Using the Transfer queue of graphics family is probably very bad for host transfers 3) Using the dedicated transfer queue is probably very bad for intra-device transfer ops 4) Using multiple graphics queues on current HW is probably nonsense

Definitely feels like something that should be figured out in next versions of Vulkan. This currently feels neither low-level, nor explicit. Has this been given some thought within Khronos?

NicolBolas commented 5 years ago

@krOoze I don't think you finished the sentence "From how many threads the driver can receive work with no"

I'm also curious as to where you obtained your list of "which things should be avoided". Do you have experimental evidence of these rules? I'm especially curious about the transfer rules. Intra-device transfers are particularly useful for staging textures, and I'm not sure why it would be best to use the transfer feature of a graphics queue instead of a dedicated transfer queue to do staging (since you're copying from device memory to device memory).

I can guess at least where you got the fourth rule from.

krOoze commented 5 years ago

@NicolBolas Thank you. I meant "no sync (or advantageous sync)".

I got it from what I remember from (often vendor specific) materials. I will look it up for you in a moment

Point is, even if I was wrong, it is allowed by the spec for GPU to act in any weird way it chooses. The API gives exactly zero information about the queues. I can do assumptions and assertions and experiment. But one thing is if you can get to the last 10-20 % perf by knowing your HW, other is having to know your HW for even the basic operation of the GPU. Consider the spec does not really even give a guarantee that each queue can feed the whole device (though that is one extra-specification assumption we commonly make).

krOoze commented 5 years ago

The citations:

ad 1: https://developer.nvidia.com/dx12-dos-and-donts

Don’t overlap compute work on the 3D queue with compute work on a dedicated asynchronous compute queue

Plus if the dedicated compute reports C queues and graphics reports G queues I find it dubious the driver really wants use to use all C+G queues. I would assume they get serialized to only C queues anyway or some other implicit nonsense happens.

ad 2: https://youtu.be/ERCxOaKr8Cw?t=2473

The Copy Engine is one magnitude slower than if you run it [transfer within GPU] over the Graphics or Compute engines.

ad 3: You take my word for it that async transfer is better than using Graphics engine that could be used for other work. Actually it does not even make sense, and my guess is the driver would use the Copy\DMA engine anyway.

ad 4: https://gpuopen.com/concurrent-execution-asynchronous-queues/

GCN hardware contains a single geometry frontend, so no additional performance will be gained by creating multiple direct queues in DirectX 12. Any command lists scheduled to a direct queue will get serialized onto the same hardware queue.

Which is not problem in Vulkan where AMD reports only one queue. Thoug NV I think reports 16, but I think it also only has one "geometry frontend" (?) and even single queue would saturate it and make it a bottleneck.

janekb04 commented 4 years ago

I would suggest simply abandoning queue families altogether (they would still be available for backwards compatibility), for reasons stated previously:

They do not map to hardware at all (and even if they do, its just a choice of the implementation, that is not exposed by it at all)
They are not explicit and low level, but a driver managed scheduling system instead
They provide no information whatsoever about: a) how many queues should actually be created b) are queues within a family competing for resources c) are families independent from each other
Different drivers interpret what families and queues are differently and implement them differently

janekb04 commented 4 years ago

I think that instead of exposing abstract "queue families", real hardware should be exposed. I would imagine it like this:

A new type of object should be added: a vkPhysicalDeviceEngine that represents a physical part of the hardware. (Alternatively it could be a vkDeviceEngine if we want to keep multiple logical devices per physical device, though the changes proposed here make it no longer needed to have more than one logical device) Examples include: the main graphics and compute engine, (multiple) Copy/DMA engines, maybe Video Encoding/Decoding Engines, and anything else that the hardware actually, physically has. vkEnumeratePhysicalDeviceEngines(physicalDevice, count, pEngines) would be used to query the amount and information about the diffrent engines. Each physical component should be a separate engine. As such, if a device has two Copy/DMA engines, both should be represented by a separate vkPhysicalDeviceEngine object instead of them being a single object with a count of two. I imagine that the vkPhysicalDeviceEngine would either contain or allow for querying the information about:

the type of the engine (what type of physical hardware it represents - compute cores, copy engines etc.)
the capabilities of the engine (what kinds of operations could be performed by it - ex. can a copy engine be used to copy data intra-device or inter host)
any other information relevant to hardware like number of cores, bandwidth, etc.

janekb04 commented 4 years ago

vkQueues would be created from a vkPhysicalDeviceEngine instead of a queue family. In detail:

It would be guaranteed that creating a single queue from an engine can fully saturate it, use it to its full potential, just like a single thread can fully use a single CPU core if it is bound to it by setting its affinity.
The application could choose which parts of hardware it actually needs and use only those. For example this could decrease power consumption as unneeded hardware would be "dormant"
It would be possible to create multiple queues per engine. There would be no limitation on how many of them could be created or it would be very high). This would allow to use a given engine in "multitasking" mode, where the driver would schedule the work on the engine similarly to how a single-core CPU can have multiple processes that seem to run concurrently.
Queues would be created and destroyed dynamically (after physical and logical device creation) just like normal CPU threads are. After all they would be just an abstract concept used to schedule work.

haasn commented 4 years ago

What I like about ^ is that it avoids altogether the need to fix VkQueues at device creation time, which can create hurdles and unnecessary API overhead when attempting to share a VkDevice across multiple components/libraries.

haasn commented 4 years ago

That being said, it does raise the obvious question of what will happen to the 'unified graphics+compute' queue family. Shouldn't e.g. a modern AMD GPU's topology look more like:

1x graphics engine
4x compute engine
2x dma engine

Because even the 'unified' graphics+compute family currently schedules compute operations onto the same four compute pipes, right?

Therefore your reasoning would dictate they be considered separate engines, with no 'unified' engine for both. This mandates the use of cross-queue synchronization.

janekb04 commented 4 years ago

I opened a new issue about this, as I think it is big enough of a change.

KhronosGroup / Vulkan-Docs

Mechanism for querying physical queue topology? #569