Make the vku::framework and vku::Window more flexible

FunMiles commented 4 years ago

I realize that the vku_framework.hpp was intended as a way to easily get started with the examples. However, I've been using it in a new project in which I want to be able to run more than one Vulkan driven window. Doing so, I discovered some weaknesses that for the most part can be remedied with a fairly limited amount of work. I am sure I will discover more issues to be worked on but here are the first things that come to mind:

The vku::Framework provides access to several queues, but no interlocking mechanism is present to make sure only one thread submits commands to a given queue at the same time. Addressing this is fairly easy via a wrapper that carries the queue at the same time as a mutex. For example, the draw routine uses graphicsQueue.submit(...). If the wrapper is used, then it is easy to write: graphicsQueueWrapper->submit(...). The operator-> of the wrapper returns an object that locks a mutex and has itself an operator-> that returns a pointer to the vk::Queue. The object lifetime is guaranteed to encompass the time of the submission and then it is unlock right after. From a user's point of view, it comes at no added complexity.
The vku::Framework should really behave like a singleton. It is easy to create two vku::Framework by mistake and get a crash when one's destructor is called. This also helps in having a single mutex per actual queue (the graphics, compute and present queues can all be the same, yet vku::Window obtains the presentation queue without reference to vku::Framework while the user must obtain the graphics and compute queues from the vku::Framework) .
The vku::Window has a nice separation of static and dynamic drawing parts. However the order of both drawings cannot be changed and the command buffer creation of the dynamic part is called right within the same function as the drawing. If there were a setDynamicCommands, then one could see the window being allowed to build the command buffers in separate threads.

As of now, it is the first point that is important to my use case and I have already tested the wrapper solution. Is that something of interest to other users?

andy-thomason commented 4 years ago

I'm happy for you to customise it as you wish. Real use cases will always drive software development. Best of all, develop a game engine or 3D application.

FunMiles commented 4 years ago

@andy-thomason I am definitely using the framework for my new code. I am discovering quite a few things to adapt and generalize. By the way, I had stated that I had a working threading but eventually discovered it was buggy. This led me to stackoverflow where a patient soul helped figure out the issue. I will be submitting a pull request with a threaded demo and will update all the demos to cache the queue, rather than asking for it on each draw call. It will take me some time though, as I need to address the framework singleton aspect. I am also working on generalizing the buffer system to easily adapt/use the popular Vulkan Memory Allocator without forcing to use it as I'm sure @lhog wouldn't like that! ;P .

FunMiles commented 4 years ago

After discovering the issue for the queue and threading, I am left wondering what approach to take. Here are my notes:

Any queue submission must be under a lock if you have threading.
If you are not going to have more than one thread doing threading, locking is not necessary
Locking when no contention exists is actually cheap

So the question becomes: (:heavy_check_mark: indicates current choices)

[X] Should only a locking version be provided?
[ ] Should a non-locking version be provided?
- [ ] via templating the queue argument?
- [ ] via an abstract interface class wrapping the queue and providing two versions, one locking, one not-locking?

Comments/opinions are welcome.

Something totally off-topic but that probably will benefit from multi-threaded rendering: I discovered ImGui and have started exploring it with Vulkan for the rendering. I think I may do an open-source demo app for image processing using Vookoo for rendering and ImGui on top for the GUI.

FunMiles commented 4 years ago

I have a fairly complete system for synchronized queue access at https://github.com/FunMiles/Vookoo/tree/lock_guard_queues There is one example opening two windows: 06-parallelTriangles. I am ready to do a PR. I have two pending PRs that have not been merged into the mainline though. Should merge them into my code and have a single full PR?

andy-thomason commented 4 years ago

That looks very good. The example is relatively short, which is very good.

It is worth timing the calls with rdtsc as mutexes can be expensive if implemented badly. At best they are a spin lock on an atomic boolean - at worst an O/S call. One of the downfalls of OpenGL was the need for a mutex on every call.

On Thu, Apr 23, 2020 at 4:04 PM FunMiles notifications@github.com wrote:

I have a fairly complete system for synchronized queue access at https://github.com/FunMiles/Vookoo/tree/lock_guard_queues There is one example opening two windows: 00-parallelTriangles. There is no cleanup for closing the windows. I need to do that before submitting a PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andy-thomason/Vookoo/issues/30#issuecomment-618448863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL36XEWG44BF227WG7R7NLROBKHTANCNFSM4MKEOL5A .

FunMiles commented 4 years ago

I do not have experience measuring mutex costs on Windows yet, but I've dealt with them a lot on Unix based system and they are cheap. The important thing to remember is that they should cost only if there is contention. And a good implementation should spin-lock for a bit before going to the OS. The Vulkan layers for example actually use mutexes to synchronize access to maps that record access to the Vulkan objects from various threads. They did two things to optimize:

They have more than one map to try to minimize the likelihood of contention
They made sure mutexes end up in individual cache lines to avoid false sharing I am using one mutex for each queue but can also make sure they end up in individual cache lines. I will add the timing and try it on Windows/Linux/Mac over the weekend.

One thing you may notice in the last update I made is that rather than submit the static and dynamic command buffers to the queue separately, I've merged them into a single submission to reduce the number of lock/unlock of the queue mutex.

On my Mac, there are three queue families, all capable of doing graphics/compute/transfer. Each only has one queue at most. One could try to better pick queue families/queues and try to create multiple queue to assign to various cases to minimize contention likelihood. A round-robin distribution maybe?

FunMiles commented 4 years ago

I have started doing rdtsc timings and on the Mac. Preliminary timings. They are a measurement of the whole timing for a call to the queue submit routine. Not a timing of the cost of just the mutex itself as this is something fairly ill-defined since a mutex can only be acquired when it has been released by any other user.

In windows:
- In debug mode, the average times for 2 windows being opened is 0.5 ms drops to 0.3ms with only one window.
- In release mode, the average time is independent of the number of windows opened and is about 0.26ms
In linux:
- In debug, independently of the number of windows, 0.064ms
- In release 0.06ms
On Mac:
- Two windows open: 16ms!!!! one window: 0.28 ms

The Mac timings are very strange. I may have to raise that with the MoltenVK development team. Some other messages about swapchain image request misuse is puzzling me when i replace the argument to submitKHR with a pointer rather than the reference, I get:

00000008 debugCallback: [ VUID-vkAcquireNextImageKHR-swapchain-01802 ] Object: 0x40000000004 (Type = 27) | vkAcquireNextImageKHR: Application has already previously acquired 3 images from swapchain. Only 2 are available to be acquired using a timeout of UINT64_MAX (given the swapchain has 3, and VkSurfaceCapabilitiesKHR::minImageCount is 2). The Vulkan spec states: If the number of currently acquired images is greater than the difference between the number of images in swapchain and the value of VkSurfaceCapabilitiesKHR::minImageCount as returned by a call to vkGetPhysicalDeviceSurfaceCapabilities2KHR with the surface used to create swapchain, timeout must not be UINT64_MAX (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkAcquireNextImageKHR-swapchain-01802)

andy-thomason / Vookoo

Make the vku::framework and vku::Window more flexible #30