Open Venemo opened 5 years ago
So, that's what the ACO developers have been doing for the past month.
Just tested aco-navi branch with The Witcher 3 in Wine-esync+dxvk (Sapphire Pulse RX 5700 XT). It causes a GPU hang with this in dmesg (computer is accessible remotely through ssh):
[ 52.097894] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 57.207014] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2248, emitted seq=2251
[ 57.207080] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process witcher3.exe pid 2612 thread witcher3.exe pid 2687
[ 57.207083] [drm] GPU recovery disabled.
@shmerl Basically most GPU hangs look like that in dmesg, so that message doesn't bring us closer to finding the problem. Can you try to identify which kind of shader is it that causes the hang? As a first step, can you try disabling CS in ACO? In radv_pipeline.c
you can edit radv_aco_supported_stage
, just comment out the CS support from there.
Also please compile mesa in debug mode, just to see if it hits any assertions and such.
Some fixes since yesterday:
Compiled with debug and captured some output. Without disabling CS:
info: Game: witcher3.exe
info: DXVK: v1.3.4
warn: OpenVR: Failed to locate module
info: Enabled instance extensions:
info: VK_KHR_get_physical_device_properties2
info: VK_KHR_surface
info: VK_KHR_win32_surface
WARNING: Experimental compiler backend enabled. Here be dragons! Incorrect rendering, GPU hangs and/or resets are likely
WARNING: radv is not a conformant vulkan implementation, testing use only.
info: AMD RADV/ACO NAVI10 (LLVM 10.0.0):
info: Driver: 19.2.99
info: Vulkan: 1.1.107
info: Memory Heap[0]:
info: Size: 7920 MiB
info: Flags: 0x1
info: Memory Type[0]: Property Flags = 0x1
info: Memory Heap[1]:
info: Size: 256 MiB
info: Flags: 0x1
info: Memory Type[2]: Property Flags = 0x7
info: Memory Heap[2]:
info: Size: 8176 MiB
info: Flags: 0x0
info: Memory Type[1]: Property Flags = 0x6
info: Memory Type[3]: Property Flags = 0xe
info: D3D11CoreCreateDevice: Probing D3D_FEATURE_LEVEL_11_0
info: D3D11CoreCreateDevice: Using feature level D3D_FEATURE_LEVEL_11_0
info: Device properties:
info: Device name: : AMD RADV/ACO NAVI10 (LLVM 10.0.0)
info: Driver version : 19.2.99
info: Enabled device extensions:
info: VK_EXT_conditional_rendering
info: VK_EXT_depth_clip_enable
info: VK_EXT_host_query_reset
info: VK_EXT_memory_priority
info: VK_EXT_shader_demote_to_helper_invocation
info: VK_EXT_shader_stencil_export
info: VK_EXT_shader_viewport_index_layer
info: VK_EXT_transform_feedback
info: VK_EXT_vertex_attribute_divisor
info: VK_KHR_create_renderpass2
info: VK_KHR_dedicated_allocation
info: VK_KHR_depth_stencil_resolve
info: VK_KHR_descriptor_update_template
info: VK_KHR_draw_indirect_count
info: VK_KHR_driver_properties
info: VK_KHR_get_memory_requirements2
info: VK_KHR_image_format_list
info: VK_KHR_maintenance1
info: VK_KHR_maintenance2
info: VK_KHR_sampler_mirror_clamp_to_edge
info: VK_KHR_shader_draw_parameters
info: VK_KHR_swapchain
info: Device features:
info: robustBufferAccess : 1
info: fullDrawIndexUint32 : 1
info: imageCubeArray : 1
info: independentBlend : 1
info: geometryShader : 1
info: tessellationShader : 1
info: sampleRateShading : 1
info: dualSrcBlend : 1
info: logicOp : 1
info: multiDrawIndirect : 1
info: drawIndirectFirstInstance : 1
info: depthClamp : 1
info: depthBiasClamp : 1
info: fillModeNonSolid : 1
info: depthBounds : 1
info: multiViewport : 1
info: samplerAnisotropy : 1
info: textureCompressionBC : 1
info: occlusionQueryPrecise : 1
info: pipelineStatisticsQuery : 1
info: vertexPipelineStoresAndAtomics : 0
info: fragmentStoresAndAtomics : 1
info: shaderImageGatherExtended : 1
info: shaderStorageImageExtendedFormats : 1
info: shaderStorageImageReadWithoutFormat : 0
info: shaderStorageImageWriteWithoutFormat : 1
info: shaderClipDistance : 1
info: shaderCullDistance : 1
info: shaderFloat64 : 1
info: shaderInt64 : 1
info: variableMultisampleRate : 1
info: VK_EXT_conditional_rendering
info: conditionalRendering : 1
info: VK_EXT_depth_clip_enable
info: depthClipEnable : 1
info: VK_EXT_host_query_reset
info: hostQueryReset : 1
info: VK_EXT_memory_priority
info: memoryPriority : 1
info: VK_EXT_shader_demote_to_helper_invocation
info: shaderDemoteToHelperInvocation : 1
info: VK_EXT_transform_feedback
info: transformFeedback : 1
info: geometryStreams : 1
info: VK_EXT_vertex_attribute_divisor
info: vertexAttributeInstanceRateDivisor : 1
info: vertexAttributeInstanceRateZeroDivisor : 1
info: Queue families:
info: Graphics : 0
info: Transfer : 0
info: DXVK: Read 471 valid state cache entries
info: DXVK: Using 16 compiler threads
warn: DXGI: VK_FORMAT_D24_UNORM_S8_UINT -> VK_FORMAT_D32_SFLOAT_S8_UINT
Hangs after that.
When disabling CS, it hangs too fast, and capturing the output just produces an empty log for me.
Let me know, if you need the game to test it, may be developers can provide a key.
All DXVK/D9VK games that I've tested gpu hang on launch for me, so it might not be just a Witcher 3 issue.
Here's the list: GTA4, GTA5, The Witcher 3, Mirror's Edge.
And, assuming I'm commenting out the correct line (stage == MESA_SHADER_COMPUTE), disabling CS has no effect.
I'm also using a custom card (MSI Evoke 5700xt) so maybe that's related.
Thanks guys for your testing. I haven't tested any DXVK games yet so it's entirely possible that their shaders do something that aco-navi isn't prepared for. I'm currently working on implementing subgroup shuffles, but I promise I'll look into what is going on with DXVK.
@Venemo Actually, even vkcube causes a GPU hang for me.
With the following error: amdgpu: radv_amdgpu_cs_query_fence_status failed. amdgpu: The CS has been rejected, see dmesg for more information. vk: error: failed to submit CS 0
Does that mean I failed to disable ACO's CS, or that Mesa's CS is hanging as well? It doesn't hang with normal LLVM Mesa, for the record.
@aqxa1 @shmerl Are you guys sure that you disabled NGG during your testing? I think I mentioned in the original post that NGG is not implemented in ACO yet, but just to clarify I now edited the post and added a few env vars to "How to test".
This works for me without hanging:
RADV_DEBUG=nongg,nocache RADV_PERFTEST=aco vkcube
@Venemo That fixes the issue, thanks. I had assumed it just wasn't used rather than it needing to be explicitly disabled.
Just ran some quick tests. Other than some random GPU hangs, I also get the following error with TW3 and GTA5: ../src/amd/vulkan/radv_descriptor_set.c:496: VK_ERROR_OUT_OF_POOL_MEMORY
It's accompanied with missing/flickering models with GTA5 at least (I didn't test TW3 extensively).
The VK_ERROR_OUT_OF_POOL_MEMORY error happens sometimes for me, I don't think it's an actual issue or related to ACO
The VK_ERROR_OUT_OF_POOL_MEMORY error happens sometimes for me, I don't think it's an actual issue or related to ACO
I should add it doesn't occur with the same games under regular Mesa LLVM, and I get hundreds of them almost immediately when getting in game.
Were you using a debug build? I think the error still happens with release builds but is only actually printed with debug builds.
If I'm remembering how DXVK handles descriptor pools correctly, this error is expected and DXVK handles it fine EDIT: I might not be remembering correctly EDIT2: I'm remembering correctly. DXVK allocates a new descriptor pool when the current one is out of memory
@aqxa1 @shmerl If you guys still experience hangs or other problems with nongg
then please give us a bit more details on what issues there are and how to reproduce those.
@pendingchaos You're right, it was just using a debug build that was caused the error messages.
@Venemo Would you prefer to add issues here, or open separate bugs?
I think separate issues would be best. Just mention in those issues that they are about aco-navi.
Thanks for the hint. Using RADV_DEBUG=nongg,nocache
I was able to start TW3, so that hang was related. It still hangs sometime later, but at least it starts. I'll try to capture some output from the other hang later.
Also, now that aco is merged into upstream, do we still need to use the external repo for testing (including for aco-navi?).
This github repo will include some optimizations that haven't yet been upstreamed for a while The contents of the aco-navi branch is not upstreamed, so you will need it to test ACO with Navi
@shmerl I really appreciate your enthusiasm, but please keep in mind that aco-navi is still under heavy development, at this point I'm happy that it can run a bunch of example programs. (When I started, it could not even run the simplest triangle example.) This sort of work is not always as straightforward as it seems. I do plan to install the Witcher 3 on my computer and use that for testing, but only after fixing the issues that I currently know about. So no need to capture it.
Great, thanks for the effort! I'll test it periodically then, to see how goes the progress.
With regards to your question: at some point I'm gonna rebase my branch on upstream (after all the NIR stuff is merged, possibly), but I don't want to send aco-navi upstream until I'm satisfied that at least most of the popular games work well.
Pardon me for being a bit clueless, but currently I’m running a 5700XT and am very interested in helping test aco-navi. i tried the mesa-aco-git package from lcarlier’s mesa-git repository, but obviously aco-navi isn’t on you guys’ master branch yet, and i got a GPU hang on boot. had to hard power cycle and boot straight into tty to fix, very messy. how would i go about installing aco-navi? is there an arch/aur/third party repo package i can install? i don’t really get how meson works and something about screwing with my graphics drivers seems a bit terrifying.
@wherron01 The short answer is that it's not there yet. There is no easy way to install it unless you are comfortable compiling mesa on your own. Please give us a couple of weeks more to get it right. :)
The aco-navi branch has gone through a major cleanup and is now rebased on top of latest mesa master (as of yesterday). There are still some issues, but overall it works much better for me. I haven't installed The Witcher 3 yet, but I did some testing with Dota 2.
Thanks! I'll give it a go a bit later and will post results!
Just tested The Witcher 3, it's still hanging after loading a saved game.
Updated tests with revision 51cab8b6990 e641024:
* Mirror's Edge - this was working in older revisions, I just had forgot to set AMD_DEBUG=nongg
Did you mean RADV_DEBUG=nongg? Because AMD_DEBUG is for radeonsi, which ACO doesn't support
* Mirror's Edge - this was working in older revisions, I just had forgot to set AMD_DEBUG=nongg
Did you mean RADV_DEBUG=nongg? Because AMD_DEBUG is for radeonsi, which ACO doesn't support
Just a typo in the comment, which I have now fixed, thanks.
Just a quick update on my progress here: I've spent the past few days on getting my patches in good shape so we can upstream them. While there are still issues, it can already run some games like Talos and Dota 2. But because there are issues, the plan is to keep ACO disabled on Navi upstream. I will continue working on it.
Currently I'm installing The Witcher 3.
@Venemo: Did TW3 work for you or it's still hanging?
@shmerl We've fixed a handful of issues, but TW3 still hangs. Trust me, you will hear about it when it works.
Progress update: I'm now at the point where aco-navi can pass the CTS (Vulkan conformance test suite) almost the same as radv with llvm. There are currently two open merge requests: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2343 which fixes a couple of things that broke a lot of tests, and https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2318 which adds subgroups support. I've also started looking into The Witcher 3 and found a problematic shader. Currently working on fixing that.
@shmerl I believe I've got a fix for The Witcher 3, at least it doesn't hang for me anymore after loading the game. I'd appreciate if you could give a try to my latest aco-navi branch.
Thanks! I'm going to test it shortly.
Due to a bug in GFX10 hardware, s_nop instructions must be added if a branch is at 0x3f. We already do this, but forgot to also update the constant addresses that come after this instruction.
Do these workaround have any impact on performance?
The 0x3f workaround shouldn't, it's needed infrequently and the s_nop should be very cheap
I built it, but for some reason ACO is not showing up in the version anymore. Do I need to enable it with RADV_PERFTEST explicitly now?
UPDATE:
Yep, I had to use RADV_PERFTEST=aco
.
Retesting now.
Just tested it, and it's still causing a hang. From vulkaninfo for aco-navi build:
driverInfo = Mesa 19.3.0-devel (git-51cab8b699) (LLVM 10.0.0)
Save that's causing a hang. tw3_hang_save.zip
GPU: Sapphire Pulse RX 5700 XT.
I updated my comment with new tests for the current aco-navi revision. Looking good with more games working properly, although I didn't test more than 15-30 minutes for each game.
@aqxa1 can you please make a note in your comment about which of those are native and which are using DXVK or something else?
@shmerl Thanks for trying. I had to upgrade my Witcher 3 but then I could load your savegame without issue. However, then it hanged within a minute as I was looking and walking around in the game. Is this what you've experienced?
@Venemo : It hanged pretty much right away, doesn't even let walking around.
Game graphics settings are on max (hairworks off). Ambient occlusion: HBAO+.
@shmerl But does the scene load up before it hangs? Because yesterday what I would see is that it hanged on the loading screen basically. Right now it loads the game and hangs some time after that. In my case, I'm running it on 3840x2160 but with the default settings.
EDIT: if I run it without debugging options, then it indeed hangs much sooner. if I run with RADV_DEBUG=shaders,syncshaders
then it lasts a bit longer.
It hanged right away after the loading screen, the scene didn't even show up. I built Mesa without debug though, so may be that's the reason.
Also, different settings can affect what shaders are used I suppose, so may be try max settings (hairworks off) / HBAO+.
@shmerl I added a couple of patches that fix some so-called "hazards" on Navi. I haven't yet fixed every hazard that LLVM knows about, but still a good number of them. I compiled mesa in release mode, ran the Witcher 3 with RADV_PERFTEST=aco
, set all graphics settings to ultra (except hairworks like you suggested), and then I set all postprocessing to maximum, and set HBAO+ too. Then I loaded your savegame and was able to walk around in the game for a couple of minutes without a hang. Can you confirm if these latest patches help you at all?
Just built it and I was able to load the save without a hang! I didn't test it for long after that, but so far it looks good! Thanks for working on those.
One thing I noticed though, that aco performance looks just a tiny bit lower than llvm in that case:
llvm:
aco:
I hope it's not a consequence of those hazards mitigations.
Why are in general such "hazards" on Navi? Is it due to the need to support backwards compatibility with GCN ISA in RDNA?
This issue is for tracking ACO's progress on Navi.
What works, what doesn't
All shader stages should work. Every Vulkan game should work.
If you find issues, please file a bug in the upstream Mesa bug tracker.
Tested hardware
Not tested with unreleased Navi cards as we don't have those. If you test with hardware that is not on the list yet, please let us know.
How to test
We suggest using the latest stable mesa, where ACO is the default compiler of the RADV Vulkan driver.
ACO is in mesa since version 19.3 but on old mesa releases, the
RADV_PERFTEST=aco
environment variable was needed.New hardware features support in Navi 1x
New hardware features support in Navi 2x
Possible optimizations
[ ] use round-robin register allocation to avoid WAR hazards (and help any post-RA scheduling)
[ ] schedule ALU instructions (after RA for easier/faster scheduling?)
[ ] choose registers to avoid bank conflicts (either as a reassignment pass or during RA)
See GCNRegBankReassign.cpp in LLVM
[ ] NGG shader based primitive culling