warp-10 commented 7 years ago

REVISION (2017-27-05)

After some consideration, and that I can't seem to find the other post right now, I believe an IOCP-styled interface is more efficient, not to mention threaded-application-friendly. This will require some fundamental changes, and two new entry points.

VkResult vkQueueCompletionSubmit(
    VkQueue queue,
    uint32_t submitPtrCount,
    VkSubmitInfo **ppSubmits // Array of pointers - each pops out one-at-a-time, or in batches
    );

VkResult vkQueueCompletionWait(
    VkQueue queue,
    uint32_t *pSubmitPtrCount,    // input is array size, output is popped size
    VkSubmitInfo **ppSubmits,    // no guarantee on order, and one not needed!
    uint64_t timeout
    );

This allows the application to provide pointers to VkSubmitInfo structures, possibly (most likely) wrapped inside other structures, to the driver without having to sort and sift many different arrays of handles.

Implications here include that all the VkSubmitInfo structures, and everything they point to, must remain valid during their trip through the driver.

The idea doesn't specifically require VkSubmitInfo, and perhaps an API-specific structure would be more appropriate. This was just a quick sketch to illustrate the idea.

ORIGINAL (Titled "Strided APIs" - a.k.a wonky select-style use of fences)

I find myself doing a lot of this:

std::vector<VkCommandPool> returnPools;       // allocation hazard and associated overhead
std::vector<VkCommandBuffer> returnBuffers;   // allocation hazard and associated overhead
std::vector<int> mapping;                     // allocation hazard and associated overhead

// ... big fugly select loop over fence handles ...

// sort buffers by their destination pool
assert(returnPools.size() == returnBuffers.size());
mapping.resize(returnPools.size());
auto itr = mapping.begin();
const auto end = mapping.end();
uint32_t value = 0;
while(itr != end){ *itr++ = value++; }
itr = mapping.begin();
std::sort(itr, end, [&](auto a, auto b)
{ return returnPools[a] < returnPools[b]; });

std::vector<VkCommandPool> sortedPools;
std::vector<VkCommandBuffer>sortedBuffers;
for(auto index  : mapping){
    sortedPools.push_back(returnPools[index]);
    sortedBuffers.push_back(returnBuffers[index]);
}

auto poolItr = sortedPools.begin();
auto poolEnd = sortedPools.end();
auto bfrItr = sortedBuffers.begin();
while(poolItr != poolEnd){
    auto pool = *poolItr;
    auto itr0 = poolItr;
    uint32_t count = 1;
    while(++poolItr != poolEnd && pool == *poolItr){ ++count; }

    // ...omitting pool sync. code

    vkFreeCommandBuffers(device, pool, count, &*bfrItr);
    bfrItr += count;
}

When it would be nice to do this:

struct BufferReturn
{
    VkCommandPool pool;
    VkCommandBuffer buffer;
};

std::vector<BufferReturn> returns;

// ... very long code here ...

auto itr = returns.begin();
const auto end = returns.end();
std::sort(itr, end, [&](auto &a, auto &b)
{ return a.pool < b.pool; });

while(itr != end){
    auto pool = itr->pool;
    auto itr0 = itr;
    while(++itr != end && pool == itr->pool){}
    // ...also omitting pool sync. code
    vkFreeCommandBuffersStrided( device, pool,
        (uint32_t)(itr - itr0),
        &itr0->buffer,
        sizeof(BufferReturn) );
}

Where:

vkFreeCommandBuffersStrided(
    VkDevice device,
    VkCommandPool pool,
    uint32_t bufferCount,
    VkCommandBuffer *data,
    size_t stride ); // <- offset, in bytes, between successive buffer handles

TL;DR

In order to minimize API calls, it is best to use them in vectored form. That way, even if an API is slow, the number of invocations is minimized.

Unfortunately, they all assume we can just allocate big homogeneous arrays of handles on the spot.

We will most likely end up putting handles inside structures that get passed around instead of the handles themselves, which could also include the data needed for proper synchronization.

Strided APIs would help immensely by reducing the number of distinct collections that need to be managed solely for the purpose of interfacing with the driver.

krOoze commented 7 years ago

You can achieve the same outside Vulkan. Also a bit evil unnecessarily messing with raw bytes in C++.

What about wrapping it something like:

void myFreeCB( VkDevice device, VkCommandPool commandPool, std::vector<BufferReturn> myCbs ){
    std::vector<VkCommandBuffer> cbs( myCbs.size() );
    const auto unwrapOp = [](const BufferReturn& br) -> VkCommandBuffer{ retrun br.buffer; };
    std::transform( myCbs.begin(), myCbs.end(), cbs.begin(), unwrapOp );
    vkFreeCommandBuffersStrided( device, commandPool, cbs.size(), cbs.data() );
}

ShabbyX commented 7 years ago

Having a packed array of stuff has the benefit of better cache usage. Why not something like this?

struct BufferReturn
{
    size_t pool;
    size_t buffer;
};
std::vector<VkCommandPool> pools;
std::vector<VkCommandBuffer> buffers;
std::vector<BufferReturn> returns;

krOoze commented 7 years ago

@ShabbyX I don't see the cache benefit. You introduced two new ints which are exactly the same layout as the original. And now you need to dereference with each use of BufferReturn contents.

Not that performance optimizing is really needed here, but IMO your code does not improve other qualities, like readability, either

ShabbyX commented 7 years ago

@krOoze looking deeper in the code of the OP, you are right, this doesn't improve his life much. Perhaps something like this might though (depending on whether it fits the rest of his use cases):

struct CommandPool
{
    size_t pool;
    size_t buffers_start, buffers_end;
};
std::vector<VkCommandPool> vk_pools;
std::vector<VkCommandBuffer> vk_buffers;
std::vector<CommandPool> pools;

The point being that you don't necessarily need strided versions of the API to make the code simpler.

Regarding cache usage and why I made the original suggestion, there's a growing tendency to do this (I don't know the name of it) at least in the game industry:

Instead of:

struct Something
{
    Property1 one;
    Property2 two;
};
std::vector<Something> stuff;

for (auto &s: stuff)
    process_property_one(s.one);
for (auto &s: stuff)
    process_property_two(s.two);

Do this:

struct Something
{
    size_t one;
    size_t two;
};
std::vector<Property1> ones;
std::vector<Property2> twos;
std::vector<Something> stuff;

for (auto &one: ones)
    process_property_one(one);
for (auto &two: twos)
    process_property_two(two);

The first method comes from object-oriented designs which in a craze used to be forced to everything. The second one violates object-oriented very much, but it's quite more efficient. This is where the cache benefit comes into play: as you process each property one by one, the properties themselves are spatially close.

There might not be a noticeable difference in the OPs particular case (since there aren't that many command buffers anyway, compared to e.g. game objects where the above optimization does make a difference). However, method 2 has the benefit of not needing "a strided API" to support a less-than-optimal use-case.

warp-10 commented 7 years ago

@krOoze The idea is to avoid additional allocations that need to be kept synchronized. As far as caching goes, the code (a pseudo-snippet from a "pop thread") assumes the cache was nuked due to waking up on a condition variable after rescheduling, and possibly after every implied call to vkWaitForFences.

Having to resize multiple parallel std::vector<> is just as much a culprit of heap fragmentation as large-then-small-then-realloc patterns (moreover, std::vector<POD> isn't guaranteed to use realloc, and is much more likely to fail growing for this reason). While it is likely the compiler may decide to optimize the std::vector<> to use stack space instead, that would instead change the risk to a stack overflow rather than OOM (I would turn such optimizations off anyway - they only work for very specific patterns and SOs are something from which you generally cannot recover).

I must admit, however, that adding more entry points just for the sake of one extra parameter just adds redundant bloat to the API, which may just turn into an OpenGL-like "maybe, maybe not" sort of extension.

I might rephrase this as a criticism of the particular design choices illustrated and their impact on an application, rather than a suggestion for an extension. Frankly, I dislike all similar APIs (like WaitForMultipleObjects) for this reason as well, but the issue should have been addressed at design time.

What would actually help much more than having to fuss around with fences would be an IOCP-like interface where the application supplies ONE struct pointer per submit, and it pops out the completion end of the queue when its done or errors out.

The problem is that we have an inversion of control from the device to the application, and IMO clumsy select-style loops over fences are not appropriate for high-performance graphics.

warp-10 commented 7 years ago

@ShabbyX I generally try to follow a data-oriented philosophy when designing as well. The problem is more about how individual access to VkXXX handles eat up entire SIMD registers one-at-a-time. The driver code is not likely written to pull out at most 4 or 8 handles at a time, not that doing so would be any benefit across an implied system call.

If I can get away with fitting 2 void* (64-bit) or 4 uint32_t inside one register and process one "object" per register, I go that route (e.g. the typical float4). So, if there is any possibility, I try to stuff multiple handles into conveniently register or cacheline-sized structs and split across multiple arrays that way.

Some things however, notably mutexes (the pointer-sized, statically initializable variety), have to be address-stable - meaning they can't be put into arrays. This makes life very difficult, especially since we have arrays of queues, and arrays of buffer pools, and so on that all need host synchronization. Unfortunately, it seems to reduce the userland "driver" on top of Vulkan to a bunch of hash tables that point to standalone allocations anyway (might as well stuff some other things in the same struct with the mutex, etc...).

So, what ends up happening in that case, we have to pull handles out of structs and put them into arrays (cache miss then miss again, possibly allocation failure), then submit those to the driver. Wait for the driver to respond via. fence signalling and recover whatever was signaled, marshaling state changes back into the structures.

An IOCP-style interface would let the application push a pointer to an extension of- or member VkSomething inside a wrapper into a queue without having to shuffle so much extra data around.

krOoze commented 6 years ago

Am gonna pin the forum discussion here: https://forums.khronos.org/showthread.php/13653-Contiuation-of-Github-650/page3 It contains more detailed proposition in the last post.

ratchetfreak commented 6 years ago

isn't this possible to do the same thing using a vector of

struct FenceWaits{
     vkFence fence; 
     void* data;
}

Then the completion queue wait is basically:

std::vector<FenceWaits> fenceWaits;

auto clr = std::partition(fenceWaits.begin(), fenceWaits.end(), [](FenceWaits &fenceWait){
    vkResult res = vkGetFenceStatus(fenceWait.fence);
    return res == VK_NOT_READY;
});

std::vector<void*> completed;
completed.reserve = fenceWaits.end()-clr;
std::transform(clr, fenceWaits.end(), std::back_inserter(completed),[](FenceWaits &fenceWait){
    return fenceWait.data;
};
fenceWaits.erase(clr, fenceWaits.end());

replace void* with whatever you need.

Only niggle here is managing the fences and getting them recycled because multiple FenceWaits may be associated with the same fence. However if you add a few reasonable restrictions to its use this is still fully safe.

there is only 1 fenceWaits buffer
access to the fenceWaits buffer is only allowed by a single thread at a time.
all FenceWaits related to a single fence must be pushed to the fenceWaits buffer before vkQueueSubmit the buffer submit. Or after the submit in at most a single atomic operation for all the remaining ones. (note that this is exclusive OR, that way there are never duplicate fences pushed on the buffer that may already be signaled)

with a lock-free queue as a staging buffer (that you drain into fenceWaits before the std::partition) you can make the submits not have to lock during the submits.

These restrictions allow you to extract all unique fences from the completed part then reset them and push them to a freshFences queue for reuse.

krOoze commented 6 years ago

I am not quite ready to champion this, but what I got from the discussion there:

Yes it should be possible to implement it using VkFences, but it would potentially be slower and/or ugly
The IOCP allows to add (and maybe remove?) payloads on the go
- Now using fences that would require to make a new thread for each added fence that waits on it. It is ugly, and the driver could potentially do better?
The IOCP could return the fence\cookie which is completed
- The fences currently require to loop over all of them using vkGetFenceStatus, which is ugly and potentially slower?
The IOCP could return arbitrary number of completed payloads (not just one, or all as fences currently do). Might have performance benefits on some implementations?
It might make sense to make the IOCP object internally synchronized (you get that for free from the OS?), which would make prettier multithreaded code for producers–consumers type situation.

warp-10 commented 6 years ago

@ratchetfreak: The problem with the current situation is being able to signal the waiting thread if a fence was added to the wait list. Fences cannot be signaled directly, so a separate command buffer and queue are needed just for an empty submit (this is a big problem if the device exposes exactly one queue). This is also why PostQueuedCompletionStatus exists for the Win32 implementation.

@krOoze:

See the above point. There is no simple way to signal a thread waiting on a set of fences.
It is slower (or we have to assume such). Every call to vkGetFenceStatus is potentially a trip to the kernel. This could cost up to 8ms or more on windows.
Fewer system calls and fewer system objects involved per transaction are always preferable. Something analogous to GetQueuedCompletionStatusEx means the waiting thread doesn't have to fuss around with a vector.
We don't need to externally synchronize access to streams on any operating system - this is true all the way down to NtWriteFile etc... There is absolutely no reason for VkQueue objects to require external synchronization. The time consumed by a queue submit is many orders of magnitude larger than what a thread will spend updating a mutex. If an application uses queues "correctly", contention can be minimized or eliminated, and the comparatively minuscule locking overhead can just be considered part of doing business (this is supposedly happening in a dedicated worker thread anyway).

ratchetfreak commented 6 years ago

note: that algorithm doesn't actually wait on any fences, instead it checks whether any fence is signaled, handles those and returns.

In most cases you can make that thread do other useful stuff instead.

In the worst case you can do something like

while(!atomic_load(&needWake) && 
       VK_TIMEOUT == vkWaitForFences(dev, fenceVec.length(), fenceVec.data(), false, 1000000))
{
}

To keep the thread checking for changes that need to be made to fenceVec. It does introduce some latency to when the fence can be noticed.

warp-10 commented 6 years ago

@ratchetfreak

I appreciate everyone's engagement on this topic so far.

I do not consider polling a solution unless there is no other option, and while this API is still "evolving" there is a possibility to add a symmetric vkSignalFence or similar. If a thread can wait on a primitive, that same primitive can be used to wake the thread. This is symmetry, and trivial to implement in the case of an OS primitive - whatever the driver is using to put the thread to sleep can be used to wake it up.

Polling is acceptable when working with a protocol that does not support any notion of "handshake". The purpose of a fence, in this case, is to form a "handshake" with the device. The missing half is not being able to signal a fence. An application signaling a fence "knows" what to expect from doing this, so the definition of "well-defined" behavior exists outside of the Vulkan spec in this case. That said, fences would be completely unnecessary (and easy for an application to emulate) in an IOCP-based design, hence this issue.

There are extensions for external semaphores, yet none for fences. What I'm doing is essentially the same as compiling SQL statements and submitting them to a remote database. Like display lists, except clients can change anything externally at the cost of a local re-compile. As such, there is little utility in being able to signal a semaphore on which the device is waiting. Prepare all data before submission, don't force the device, driver, or server to wait while some buffering is going on. Buffering etc... are things that should be happening during frame preparation, not after submission. Submission is the client saying "Here, start this transaction. I'm done preparing it". But, this is just my opinion.

Now, instead of just having the device attached to a PCI port, or resident on the processor silicon, its also being controlled through an RPC layer. That RPC layer now needs to implement the Vulkan API (most recently in my case, something isomorphic to it), as well as emulate any local behavior. This emulation is able to signal a "fence" without needing to round-trip over a possibly slow network connection, being able to signal the fence on the remote machine would avoid the need for the "separate queue" hack to keep the server responsive.

KhronosGroup / Vulkan-Docs

IOCP-style queue interface is more appropriate #508

REVISION (2017-27-05)

ORIGINAL (Titled "Strided APIs" - a.k.a wonky select-style use of fences)