Magic driver limits not exposed by the Vulkan API causing app crashes.

Unarmed1000 commented 1 year ago

I recently became aware that some Vulkan drivers seem to have arbitrary 'limits' in their driver that can not be discovered by the app at runtime that can cause the app to crash.

For example calling vkCmdDrawIndexed with a too high instance count can cause the Vulkan Arm driver to return VK_ERROR_DEVICE_LOST. The real problem with this is that the actual allowed 'instance count' nor information for calculating it for the actual device can not be acquired from the Vulkan API, making it impossible to write a app that adapts to the hardware (or maybe even runs on all hardware).

So it seems weird to me that a conforming Vulkan driver is allowed to impose such limits that does not seem to be allowed by the spec.

This article from Arm discusses the magic limits.

Important quotes:

"a Vulkan application might trigger an out of memory (OOM) condition on Mali GPUs. It results in a DEVICE_LOST error, even if the API usage is correct. "
"The limit is fixed to 180MB on current Mali GPUs, but it may be increased or lifted altogether in future GPUs."
"The only real solution to the issue is to keep the application’s vertex count below approximately 2 million."

So even ARM will not give any secure way to calculate it, its just 'try it out on our hardware and current driver version (which can be compiled with different options to limit or increase the memory size) and the limit might change at any point in time at their will.

For apps intended to just work on 'vulkan' no matter the gpu / driver vendor and platform having magic hard-limits like this that can not be discovered from the Vulkan API seems buggy.

So any such driver specific limit seems like something that the Vulkan API ought to provide information about so the apps have a chance of handling it, instead of having to discover crashes in the wild, then having to track down a random internet page to get a magic number that might change a any point in the future.. This is not something you can write apps for.

If we just look at Android devices apps have over 20000+ hardware devices they have to support and all of them come with multiple driver versions on top of that, even if the GPU is the same each physical device might have been compiled with different limits in the drivers that the app can not discover. So just for the Android environment there is no way a app can be expected to be tested and know about all possible limits that are not exposed by the used API. Furthermore a Android app is written once and might run for 3-20+years without updates (as it should be able to if it follows a spec). Basically as long as Android decides to allow the app to run and as long as the Vulkan API version is available on a device we should expect a conforming app to run without recompilation or special code paths.

Further details can be found here and here.

krOoze commented 1 year ago

One way to fight this would be shamelessly adding code doing that to the CTS.

akeley98NV commented 1 year ago

These sort of low-level resource exhaustion issues seem a bit of a gray area in the spec. In a sense one can argue that things like TDR and the Linux oom-killer are out-of-spec as well since there's no way to query the magic 2 second Windows timeout value or to query the magic Linux "pick process to kill" heuristic. It is an unfortunate spec backdoor since we're essentially informally relying on vendors not to be shameless and use device-lost only to report "real" failures and not as a get-out-of-jail-free card to avoid having to implement anything in Vulkan they don't like (e.g. not exporting to the user the responsibility to handle those pesky internal limits caused by weaknesses in your hardware architecture).

Unarmed1000 commented 1 year ago

I think history has shown that you can not rely on "vendors not to be shameless" :)

But it seems like the intend of the spec is that once you successfully acquired your resources and start executing valid draw commands instead, the app should kind of be allowed to expect those draw commands to not fail due to internal driver limits.

If this was a GLES2/3 app I would expect rendering just to be slower, never to crash the app.

For real world app development we can simply not expect apps to be tested on all possible hardware / software configurations in the world due to the amount of these.

The worst part of this is that it is the app that will be blamed for the crash and not the driver vendor. If a android app starts to crash on various devices because of a issue it has no chance of knowing about it will be flagged by googles crash statistics and be negatively affected by these crashes.

The limits are pretty bad for apps that target both really high end and low end devices. If it provides the user with configuration options to adjust to their device, some users will now get a crash instead of just slow performance.

Things like this really needs to be exposed to the app somehow or simply not allowed at all by the spec

oddhack commented 1 year ago

I think history has shown that you can not rely on "vendors not to be shameless" :)

OT but the funniest example of this I can recall is the ISV who created an extension intended to benefit their CAD app, then convinced several OpenGL IHVs to support it as a "hidden" feature not exposed in GL_EXTENSIONS. I think they wanted to prevent other CAD apps from using it. The funniest part was that not even the IHVs knew their competitors were supporting it, and it came out kind of accidentally during ARB discussions IIRC.

akeley98NV commented 1 year ago

I do wonder what would happen if an effort were made to make CTS more accessible and crowdsource test cases from the wider Vulkan developer community. (At least for me the CTS code felt truly Byzantine without the personal assistance of someone already familiar with the codebase, compared to the validation layers which were very easy to contribute to).

Unarmed1000 commented 1 year ago

Just for the record I tested a instancing app on both GLES3 and Vulkan on a Arm GPU and the GLES3 version renders slowly as expected but the Vulkan app crashes with the device lost error. As Vulkan does not allow the app to handle the issue in any currently known way, it basically means that Vulkan is a much worse proposition than GLES3. Which should not be the case for the 'next gen' API.

SaschaWillems commented 11 months ago

Is this on a current ARM GPU? And if so did you contact the device vendor on this issue? I have seen similar issues, but usually with earlier (Vulkan supporting) devices.

Unarmed1000 commented 11 months ago

It's on a GPU that is still being produced new low-end devices with this year and multiple years going forward. But it is classified as a older GPU.

Yes, I contacted the gpu vendor and they consider it a 'out of memory' issue instead of a driver issue. So they have so far refused to solve it.

Hence the need to get it handled by the standard moving forward.

SaschaWillems commented 11 months ago

Yeah, I can see the need for this. But "out of memory" is probably the easiest case of them all. Can't say much, but I have seen things far worse than that on mobile that I don't think can ever be caught by CTS.

Unarmed1000 commented 11 months ago

Since this renders perfectly fine with the available memory on GLES 3 the the "out of memory" excuse basically just seems to be a excuse to get away with having a 'simpler to implement' driver.

Why would their Vulkan driver not be able to 'segment' the instance rendering into what fits in available memory like GLES3 does.

If a driver is not allowed to segment the instances as needed and rely on the app handling it, then vulkan also needs functions to allow the app to calculate the limit.

I 100% would prefer a slow rendering instead of crashes. At least in the case of slow rendering the user can then go to the app settings and lower the details.

KhronosGroup / Vulkan-Docs

Magic driver limits not exposed by the Vulkan API causing app crashes. #2266