Try / OpenGothic

Reimplementation of Gothic 2 Notr
MIT License
1.07k stars 78 forks source link

Crash unless meshlets disabled #577

Open hwnde opened 3 months ago

hwnde commented 3 months ago

Windows 11, fresh install, commit 26b816e36d05832f90851d2924209f14e6d88050 Game crashes when starting a new game or loading an existing save. Disabling meshlets resolves the issue. crash.log

dreimer1986 commented 3 months ago

https://github.com/Try/OpenGothic/issues/451#issuecomment-1937670689

Lemme guess... AMD graphics card? (And I answer it myself: AMD Radeon RX 6800 XT)

Try commented 3 months ago

AMD commercial driver has fully broken mesh-shading unfortunately. If you have time - please send bug-reports to them, as we need a driver fix.

hwnde commented 3 months ago

Could you elaborate on what the issue is? I couldn't find anything in Google. Also what is the consequence of disabling meshlets? Are there any visual differences or it's only a performance optimization?

dreimer1986 commented 3 months ago

AMD commercial driver has fully broken mesh-shading unfortunately. If you have time - please send bug-reports to them, as we need a driver fix.

AMD seems to no care much about these reports though. We sent one already. I think they disagree with that being their fault, even though OpenGothic runs on Intel and NVidia cards just fine.

Try commented 3 months ago

I think they disagree with that being their fault, even though...

Even though OpenGothic runs on same hardware with opensource driver just fine :)

Could you elaborate on what the issue is? Also what is the consequence of disabling meshlets?

Consequence is performance. In Vulkan or DX12 or any other graphics api, there is a base spec with basic functionality and extensions. In OpenGothic engine checks for mesh-shader extension, and uses it when possible to draw mesh-clusters. If no mesh-shader game can still run, with fallback to triangle-lists. And triangle list is part of base spec - so should work. Reason to no use triangle-list is total lack of vertex-reuse, what forces engine to do 3-times more work in geometry. Another problem is occupancy. On NVidia triangle-list, capt to 93% utilization rate in theory... in tests is as bad as 6%

On a bug side - pretty certain that it's AMD bug. When engine calls to vkCreateGraphicsPipelines, it doesn't compile shader, and doesn't return en error code - it hard crashes deep in amdvlk64.dll.

hwnde commented 3 months ago

I can try to debug the issue to figure out what exactly causes it and whether the shader can be modified to compile. But I don't see any instructions for building and debugging the project in Windows. Does anyone have experience with it?

dreimer1986 commented 3 months ago

MSVC 2022 and the project downloaded off a private fork. There were some dependencies to fulfill, but that was a while ago I had to do that. Try these easy steps first and then we will see what's missing.

Likely you will need:

Likely I forgot something, but this is a good start to minimize the problems.

Try commented 3 months ago

debugging the project in Windows. Does anyone have experience with it?

cmake + VulkanSDK + windows-sdk should be enough. No need to install glslang or set environment variables - VulkanSDK installer will take care of it.

CMakeLists.txt can be opened with Visual-Studio (open folder) and should work just fine. If there is any issues with bring up - let me know.

Shader that you interested in is main.vert - it has multiple variants, based on geometry-type and other factors.

hwnde commented 2 months ago

I've got some results. The culprit is this line: shOut[laneID] = var;

I managed to get the shader to work by replacing it with this:

#if defined(MAT_COLOR)
    shOut[laneID].color = var.color;
#endif
#if defined(MAT_UV)
    shOut[laneID].uv = var.uv;
#endif
#if defined(MAT_POSITION)
    shOut[laneID].pos.x = var.pos.x;
    shOut[laneID].pos.y = var.pos.y;
    shOut[laneID].pos.z = var.pos.z;
#endif
#if defined(MAT_NORMAL)
    shOut[laneID].normal.x = var.normal.x;
    shOut[laneID].normal.y = var.normal.y;
    shOut[laneID].normal.z = var.normal.z;
#endif

Assigning pos or normal directly shOut[laneID].pos = var.pos; results in a crash. I don't know why. I guess for some reason it doesn't like vec3 since color and uv work correctly. I can create a PR with this solution unless there are better ideas.

Try commented 2 months ago

Thanks @hwnde , for investigating it!

I can create a PR with this solution unless there are better ideas.

This is an unease split:

My suggestion here is to report this issue again, to AMD: someone who owns their gpu can use https://www.amd.com/en/support/kb/faq/amdbrt# In report it's important to make clear that problem scope is not a game, but rather a small part of the shader. It seems they are unable to unroll varying vec3 assignment properly. float/vec2/vec4 are safe, likely due to power of two size.

I'm not really aware, if they have other way of bug reporting specifically for game-developers, who does not own their gpu.

Secondary option: disable mesh-shader on AMD by default - call it broken. Two major downsides here: I'm not happy to have conditionals with HW-vendor names, and we will ruin performance for MESA users.

hwnde commented 2 months ago

I'm not really aware, if they have other way of bug reporting specifically for game-developers, who does not own their gpu.

I think this is where such issues are reported: https://community.amd.com/t5/opengl-vulkan/bd-p/opengl-and-vulkan-discussions

I can post an issue and see if AMD responds. However it's still an open question what to do with this issue. There is no guarantee that AMD will do a fix any time soon or ever. I'd argue that doing the shader workaround is the approach with least downsides right now. First of all, although it addresses an issue with specific driver, the code itself will work with any driver and any vendor, we don't have to do any vendor detection here. On the other hand disabling mesh shader on AMD will require vendor detection and also will deprive AMD users on Windows (which are the vast majority of all AMD users) from performance benefits that come with mesh shaders.

Try commented 2 months ago

Started a post on AMD for forum: https://community.amd.com/t5/newcomers-start-here/driver-crash-when-using-vk-ext-mesh-shader/m-p/667490#M2963

Feel, free to join discussion :)

I'd argue that doing the shader workaround is the approach with least downsides right now.

I'm still worry, about maintaining it, since shader code is always something to edit. Let, see how fast they will respond to the report; Maybe merging fix + add a comment with URL to community.amd.com + comment when it's should be safe to remove workaround is the way

AnyOldName3 commented 2 months ago

I'm an OpenMW graphics developer who's been sent a link to this, and have a couple of recommendations:

Try commented 2 months ago

Hi, @AnyOldName3 and thanks for recommendation!

Get at least one of your team to join AMD's Vanguard program.

That would be nice. Unfortunately can't take opportunity myself - due to lack of AMD gpu. Maybe @hwnde or @dreimer1986 are interested?

And one more option: sneak a test to Vulkan-CTS, that uses mesh-shader + vec3 :D