Try / Tempest

3d graphics engine
MIT License
83 stars 24 forks source link

Improper handling of errors during swapchain creation / swapchain creation failures #64

Closed lmichaelis closed 1 month ago

lmichaelis commented 2 months ago

Heyo, I've been having problems running OpenGothic on Fedora 40 with an NVIDIA GeForce RTX 3060 Laptop GPU (proprietary drivers, version 550.78). On startup, swapchain creation

https://github.com/Try/Tempest/blob/e85b2926e170e3290d9a56dd9f10dd32ac5b8cf0/Engine/gapi/vulkan/vswapchain.cpp#L62

fails here,

https://github.com/Try/Tempest/blob/e85b2926e170e3290d9a56dd9f10dd32ac5b8cf0/Engine/gapi/vulkan/vswapchain.cpp#L309-L310

throwing a DeviceLostException which is caught here,

https://github.com/Try/Tempest/blob/e85b2926e170e3290d9a56dd9f10dd32ac5b8cf0/Engine/gapi/vulkan/vswapchain.cpp#L64-L66

leading to swapchain cleanup. It then tries to wait for a fence here,

https://github.com/Try/Tempest/blob/e85b2926e170e3290d9a56dd9f10dd32ac5b8cf0/Engine/gapi/vulkan/vswapchain.cpp#L76

which hangs indefinitly. Now, I don't know why it would fail during initialization in the first place but here are some ideas:

Whatever the case, it shouldn't hang indefinitely when it fails swapchain creation.

Try commented 1 month ago

Hi, @lmichaelis !

This is actually quite horrible bug to deal with...

So, basically VK_ERROR_OUT_OF_DATE_KHR means that:

A surface has changed in such a way that it is no longer compatible with the swapchain, and further presentation requests using the swapchain will fail. Applications must query the new surface properties and recreate their swapchain if they wish to continue presenting to the surface.

This case is not possible on windows (due to how WSI works here), but on X11, where window can change state asynchronously, from what UI thread observes, this apparently the case. Basically engine have to retry swaphain creation, until some different error code received.

Whatever the case, it shouldn't hang indefinitely when it fails swapchain creation.

Apparently vkAcquireNextImageKHR failed to finish, but still set fence in waiting state, while associated work wasn't issued.

The Vulkan driver should not be problem, since vkcube runs without issue

Should not be representative. AFAIR vkcube, as most Vulkan-educational apps do ad-hook initialization of window/swapchain. Application code is way more complex (also thx AMD here for unordered acquire - this is why semaphore/fence spaghetti is around :) )

I do have an Optimus capable laptop with an integrated and a dedicated GPU. Maybe Tempest is selecting the wrong one?

Should not mater - Tempest doesn't do auto-select, game does. OpenGothic gives priority to dedicated GPU (see main.cpp).

"DRM kernel driver 'nvidia-drm' in use. NVK requires nouveau."

Not familiar with that message; google tells that it related to opensource driver, not to proprietary

Try commented 1 month ago

Did vkAcquireNextImageKHR return something to &id ? If so, maybe error code can be ignored in case of constructor.

lmichaelis commented 1 month ago

Did vkAcquireNextImageKHR return something to &id ? If so, maybe error code can be ignored in case of constructor.

Yes indeed it returns a value like this: 4294967295

Try commented 1 month ago

4294967295, aka uint32_t(-1) is not a value. correct result should be in range of 0..2 - this is Id of back-buffer image

Today actually realized, that I was slightly wrong on hang reason:

  vkWaitForFences(device.device.impl,1,&f,VK_TRUE,std::numeric_limits<uint64_t>::max()); // wait for any already issues workload
  vkResetFences(device.device.impl,1,&f); // clear VkFence to non-signaled state

  uint32_t id   = uint32_t(-1);
  VkResult code = vkAcquireNextImageKHR(device.device.impl,
                                        swapChain,
                                        std::numeric_limits<uint64_t>::max(),
                                        slot.acquire,
                                        f, // fence should be in pending state, after successful aquire
                                        &id);
  if(ignoreSuboptimal && code==VK_SUBOPTIMAL_KHR)
    code = VK_SUCCESS;

Basically, in case of error code fence goes into inconsistent state, when it wait for something that is not issued.

Try commented 1 month ago

I've pushed my solution to hang. Unfortunately there is no good options in vanilla-vulkan, only to recreate fence entirely.

Can you please check, if it helps with hang? Thanks!

lmichaelis commented 1 month ago

Yes it does, thanks!

Try commented 1 month ago

In 2733ed2 I've added a swap-chain creation loop, that should be able to handle X11 shenanigans.

Have had to wrap vkAcquireNextImage only for sake of debugging, - unfortunately this is only was for me to reproduce such on widows.

lmichaelis commented 1 month ago

Amazing! Now the crash is gone and OpenGothic now works correctly. Thank you!

Try commented 1 month ago

Out of curiosity: how many attempts it takes, for the engine, to allocate swapchain? Relevant code:

void VSwapchain::createSwapchain(VDevice& device) {
  for(uint32_t attempt=0; ; ++attempt) {
lmichaelis commented 1 month ago

Out of curiosity: how many attempts it takes, for the engine, to allocate swapchain?

The function gets called multiple times and it takes between 1 and 4 attempts to perform the operation successfully (i.e. reaching the break) during the first call. Subsequent calls always acquire the swapchain immediately. How many attempts it takes depends on how fast the code runs: When I debug it, it takes 1 to 2 attempts but when I printf the result like so it takes up to 4.

image

Interesting too is this: If I change the GPU selection to "integrated only", the swapchain can always be created immediately.