daniel-schuermann / mesa

Mesa 3D graphics library (mirror; no pull requests here please)
http://mesa3d.org
135 stars 3 forks source link

GFX10 and GFX10.3 (Navi, RDNA) support in ACO #136

Open Venemo opened 4 years ago

Venemo commented 4 years ago

This issue is for tracking ACO's progress on Navi.

What works, what doesn't

All shader stages should work. Every Vulkan game should work.

If you find issues, please file a bug in the upstream Mesa bug tracker.

Tested hardware

Not tested with unreleased Navi cards as we don't have those. If you test with hardware that is not on the list yet, please let us know.

How to test

We suggest using the latest stable mesa, where ACO is the default compiler of the RADV Vulkan driver.

ACO is in mesa since version 19.3 but on old mesa releases, the RADV_PERFTEST=aco environment variable was needed.

New hardware features support in Navi 1x

New hardware features support in Navi 2x

Possible optimizations

SR-dude commented 4 years ago

So, that's what the ACO developers have been doing for the past month.

shmerl commented 4 years ago

Just tested aco-navi branch with The Witcher 3 in Wine-esync+dxvk (Sapphire Pulse RX 5700 XT). It causes a GPU hang with this in dmesg (computer is accessible remotely through ssh):

[   52.097894] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[   57.207014] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2248, emitted seq=2251
[   57.207080] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process witcher3.exe pid 2612 thread witcher3.exe pid 2687
[   57.207083] [drm] GPU recovery disabled.
Venemo commented 4 years ago

@shmerl Basically most GPU hangs look like that in dmesg, so that message doesn't bring us closer to finding the problem. Can you try to identify which kind of shader is it that causes the hang? As a first step, can you try disabling CS in ACO? In radv_pipeline.c you can edit radv_aco_supported_stage, just comment out the CS support from there.

Venemo commented 4 years ago

Also please compile mesa in debug mode, just to see if it hits any assertions and such.

Venemo commented 4 years ago

Some fixes since yesterday:

shmerl commented 4 years ago

Compiled with debug and captured some output. Without disabling CS:

info:  Game: witcher3.exe
info:  DXVK: v1.3.4
warn:  OpenVR: Failed to locate module
info:  Enabled instance extensions:
info:    VK_KHR_get_physical_device_properties2
info:    VK_KHR_surface
info:    VK_KHR_win32_surface
WARNING: Experimental compiler backend enabled. Here be dragons! Incorrect rendering, GPU hangs and/or resets are likely
WARNING: radv is not a conformant vulkan implementation, testing use only.
info:  AMD RADV/ACO NAVI10 (LLVM 10.0.0):
info:    Driver: 19.2.99
info:    Vulkan: 1.1.107
info:    Memory Heap[0]: 
info:      Size: 7920 MiB
info:      Flags: 0x1
info:      Memory Type[0]: Property Flags = 0x1
info:    Memory Heap[1]: 
info:      Size: 256 MiB
info:      Flags: 0x1
info:      Memory Type[2]: Property Flags = 0x7
info:    Memory Heap[2]: 
info:      Size: 8176 MiB
info:      Flags: 0x0
info:      Memory Type[1]: Property Flags = 0x6
info:      Memory Type[3]: Property Flags = 0xe
info:  D3D11CoreCreateDevice: Probing D3D_FEATURE_LEVEL_11_0
info:  D3D11CoreCreateDevice: Using feature level D3D_FEATURE_LEVEL_11_0
info:  Device properties:
info:    Device name:     : AMD RADV/ACO NAVI10 (LLVM 10.0.0)
info:    Driver version   : 19.2.99
info:  Enabled device extensions:
info:    VK_EXT_conditional_rendering
info:    VK_EXT_depth_clip_enable
info:    VK_EXT_host_query_reset
info:    VK_EXT_memory_priority
info:    VK_EXT_shader_demote_to_helper_invocation
info:    VK_EXT_shader_stencil_export
info:    VK_EXT_shader_viewport_index_layer
info:    VK_EXT_transform_feedback
info:    VK_EXT_vertex_attribute_divisor
info:    VK_KHR_create_renderpass2
info:    VK_KHR_dedicated_allocation
info:    VK_KHR_depth_stencil_resolve
info:    VK_KHR_descriptor_update_template
info:    VK_KHR_draw_indirect_count
info:    VK_KHR_driver_properties
info:    VK_KHR_get_memory_requirements2
info:    VK_KHR_image_format_list
info:    VK_KHR_maintenance1
info:    VK_KHR_maintenance2
info:    VK_KHR_sampler_mirror_clamp_to_edge
info:    VK_KHR_shader_draw_parameters
info:    VK_KHR_swapchain
info:  Device features:
info:    robustBufferAccess                     : 1
info:    fullDrawIndexUint32                    : 1
info:    imageCubeArray                         : 1
info:    independentBlend                       : 1
info:    geometryShader                         : 1
info:    tessellationShader                     : 1
info:    sampleRateShading                      : 1
info:    dualSrcBlend                           : 1
info:    logicOp                                : 1
info:    multiDrawIndirect                      : 1
info:    drawIndirectFirstInstance              : 1
info:    depthClamp                             : 1
info:    depthBiasClamp                         : 1
info:    fillModeNonSolid                       : 1
info:    depthBounds                            : 1
info:    multiViewport                          : 1
info:    samplerAnisotropy                      : 1
info:    textureCompressionBC                   : 1
info:    occlusionQueryPrecise                  : 1
info:    pipelineStatisticsQuery                : 1
info:    vertexPipelineStoresAndAtomics         : 0
info:    fragmentStoresAndAtomics               : 1
info:    shaderImageGatherExtended              : 1
info:    shaderStorageImageExtendedFormats      : 1
info:    shaderStorageImageReadWithoutFormat    : 0
info:    shaderStorageImageWriteWithoutFormat   : 1
info:    shaderClipDistance                     : 1
info:    shaderCullDistance                     : 1
info:    shaderFloat64                          : 1
info:    shaderInt64                            : 1
info:    variableMultisampleRate                : 1
info:  VK_EXT_conditional_rendering
info:    conditionalRendering                   : 1
info:  VK_EXT_depth_clip_enable
info:    depthClipEnable                        : 1
info:  VK_EXT_host_query_reset
info:    hostQueryReset                         : 1
info:  VK_EXT_memory_priority
info:    memoryPriority                         : 1
info:  VK_EXT_shader_demote_to_helper_invocation
info:    shaderDemoteToHelperInvocation         : 1
info:  VK_EXT_transform_feedback
info:    transformFeedback                      : 1
info:    geometryStreams                        : 1
info:  VK_EXT_vertex_attribute_divisor
info:    vertexAttributeInstanceRateDivisor     : 1
info:    vertexAttributeInstanceRateZeroDivisor : 1
info:  Queue families:
info:    Graphics : 0
info:    Transfer : 0
info:  DXVK: Read 471 valid state cache entries
info:  DXVK: Using 16 compiler threads
warn:  DXGI: VK_FORMAT_D24_UNORM_S8_UINT -> VK_FORMAT_D32_SFLOAT_S8_UINT

Hangs after that.

shmerl commented 4 years ago

When disabling CS, it hangs too fast, and capturing the output just produces an empty log for me.

Let me know, if you need the game to test it, may be developers can provide a key.

aqxa1 commented 4 years ago

All DXVK/D9VK games that I've tested gpu hang on launch for me, so it might not be just a Witcher 3 issue.

Here's the list: GTA4, GTA5, The Witcher 3, Mirror's Edge.

And, assuming I'm commenting out the correct line (stage == MESA_SHADER_COMPUTE), disabling CS has no effect.

I'm also using a custom card (MSI Evoke 5700xt) so maybe that's related.

Venemo commented 4 years ago

Thanks guys for your testing. I haven't tested any DXVK games yet so it's entirely possible that their shaders do something that aco-navi isn't prepared for. I'm currently working on implementing subgroup shuffles, but I promise I'll look into what is going on with DXVK.

aqxa1 commented 4 years ago

@Venemo Actually, even vkcube causes a GPU hang for me.

With the following error: amdgpu: radv_amdgpu_cs_query_fence_status failed. amdgpu: The CS has been rejected, see dmesg for more information. vk: error: failed to submit CS 0

Does that mean I failed to disable ACO's CS, or that Mesa's CS is hanging as well? It doesn't hang with normal LLVM Mesa, for the record.

Venemo commented 4 years ago

@aqxa1 @shmerl Are you guys sure that you disabled NGG during your testing? I think I mentioned in the original post that NGG is not implemented in ACO yet, but just to clarify I now edited the post and added a few env vars to "How to test".

This works for me without hanging:

RADV_DEBUG=nongg,nocache RADV_PERFTEST=aco vkcube
aqxa1 commented 4 years ago

@Venemo That fixes the issue, thanks. I had assumed it just wasn't used rather than it needing to be explicitly disabled.

aqxa1 commented 4 years ago

Just ran some quick tests. Other than some random GPU hangs, I also get the following error with TW3 and GTA5: ../src/amd/vulkan/radv_descriptor_set.c:496: VK_ERROR_OUT_OF_POOL_MEMORY

It's accompanied with missing/flickering models with GTA5 at least (I didn't test TW3 extensively).

pendingchaos commented 4 years ago

The VK_ERROR_OUT_OF_POOL_MEMORY error happens sometimes for me, I don't think it's an actual issue or related to ACO

aqxa1 commented 4 years ago

The VK_ERROR_OUT_OF_POOL_MEMORY error happens sometimes for me, I don't think it's an actual issue or related to ACO

I should add it doesn't occur with the same games under regular Mesa LLVM, and I get hundreds of them almost immediately when getting in game.

pendingchaos commented 4 years ago

Were you using a debug build? I think the error still happens with release builds but is only actually printed with debug builds.

If I'm remembering how DXVK handles descriptor pools correctly, this error is expected and DXVK handles it fine EDIT: I might not be remembering correctly EDIT2: I'm remembering correctly. DXVK allocates a new descriptor pool when the current one is out of memory

Venemo commented 4 years ago

@aqxa1 @shmerl If you guys still experience hangs or other problems with nongg then please give us a bit more details on what issues there are and how to reproduce those.

aqxa1 commented 4 years ago

@pendingchaos You're right, it was just using a debug build that was caused the error messages.

@Venemo Would you prefer to add issues here, or open separate bugs?

Venemo commented 4 years ago

I think separate issues would be best. Just mention in those issues that they are about aco-navi.

shmerl commented 4 years ago

Thanks for the hint. Using RADV_DEBUG=nongg,nocache I was able to start TW3, so that hang was related. It still hangs sometime later, but at least it starts. I'll try to capture some output from the other hang later.

shmerl commented 4 years ago

Also, now that aco is merged into upstream, do we still need to use the external repo for testing (including for aco-navi?).

pendingchaos commented 4 years ago

This github repo will include some optimizations that haven't yet been upstreamed for a while The contents of the aco-navi branch is not upstreamed, so you will need it to test ACO with Navi

Venemo commented 4 years ago

@shmerl I really appreciate your enthusiasm, but please keep in mind that aco-navi is still under heavy development, at this point I'm happy that it can run a bunch of example programs. (When I started, it could not even run the simplest triangle example.) This sort of work is not always as straightforward as it seems. I do plan to install the Witcher 3 on my computer and use that for testing, but only after fixing the issues that I currently know about. So no need to capture it.

shmerl commented 4 years ago

Great, thanks for the effort! I'll test it periodically then, to see how goes the progress.

Venemo commented 4 years ago

With regards to your question: at some point I'm gonna rebase my branch on upstream (after all the NIR stuff is merged, possibly), but I don't want to send aco-navi upstream until I'm satisfied that at least most of the popular games work well.

wherron01 commented 4 years ago

Pardon me for being a bit clueless, but currently I’m running a 5700XT and am very interested in helping test aco-navi. i tried the mesa-aco-git package from lcarlier’s mesa-git repository, but obviously aco-navi isn’t on you guys’ master branch yet, and i got a GPU hang on boot. had to hard power cycle and boot straight into tty to fix, very messy. how would i go about installing aco-navi? is there an arch/aur/third party repo package i can install? i don’t really get how meson works and something about screwing with my graphics drivers seems a bit terrifying.

Venemo commented 4 years ago

@wherron01 The short answer is that it's not there yet. There is no easy way to install it unless you are comfortable compiling mesa on your own. Please give us a couple of weeks more to get it right. :)

Venemo commented 4 years ago

The aco-navi branch has gone through a major cleanup and is now rebased on top of latest mesa master (as of yesterday). There are still some issues, but overall it works much better for me. I haven't installed The Witcher 3 yet, but I did some testing with Dota 2.

shmerl commented 4 years ago

Thanks! I'll give it a go a bit later and will post results!

shmerl commented 4 years ago

Just tested The Witcher 3, it's still hanging after loading a saved game.

aqxa1 commented 4 years ago

Updated tests with revision 51cab8b6990 e641024:

Working

Hangs

Glitches

pendingchaos commented 4 years ago
* Mirror's Edge - this was working in older revisions, I just had forgot to set AMD_DEBUG=nongg

Did you mean RADV_DEBUG=nongg? Because AMD_DEBUG is for radeonsi, which ACO doesn't support

aqxa1 commented 4 years ago
* Mirror's Edge - this was working in older revisions, I just had forgot to set AMD_DEBUG=nongg

Did you mean RADV_DEBUG=nongg? Because AMD_DEBUG is for radeonsi, which ACO doesn't support

Just a typo in the comment, which I have now fixed, thanks.

Venemo commented 4 years ago

Just a quick update on my progress here: I've spent the past few days on getting my patches in good shape so we can upstream them. While there are still issues, it can already run some games like Talos and Dota 2. But because there are issues, the plan is to keep ACO disabled on Navi upstream. I will continue working on it.

Currently I'm installing The Witcher 3.

shmerl commented 4 years ago

@Venemo: Did TW3 work for you or it's still hanging?

Venemo commented 4 years ago

@shmerl We've fixed a handful of issues, but TW3 still hangs. Trust me, you will hear about it when it works.

Venemo commented 4 years ago

Progress update: I'm now at the point where aco-navi can pass the CTS (Vulkan conformance test suite) almost the same as radv with llvm. There are currently two open merge requests: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2343 which fixes a couple of things that broke a lot of tests, and https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2318 which adds subgroups support. I've also started looking into The Witcher 3 and found a problematic shader. Currently working on fixing that.

Venemo commented 4 years ago

@shmerl I believe I've got a fix for The Witcher 3, at least it doesn't hang for me anymore after loading the game. I'd appreciate if you could give a try to my latest aco-navi branch.

shmerl commented 4 years ago

Thanks! I'm going to test it shortly.

Due to a bug in GFX10 hardware, s_nop instructions must be added if a branch is at 0x3f. We already do this, but forgot to also update the constant addresses that come after this instruction.

Do these workaround have any impact on performance?

pendingchaos commented 4 years ago

The 0x3f workaround shouldn't, it's needed infrequently and the s_nop should be very cheap

shmerl commented 4 years ago

I built it, but for some reason ACO is not showing up in the version anymore. Do I need to enable it with RADV_PERFTEST explicitly now?

UPDATE:

Yep, I had to use RADV_PERFTEST=aco.

Retesting now.

shmerl commented 4 years ago

Just tested it, and it's still causing a hang. From vulkaninfo for aco-navi build:

driverInfo = Mesa 19.3.0-devel (git-51cab8b699) (LLVM 10.0.0)

Save that's causing a hang. tw3_hang_save.zip

GPU: Sapphire Pulse RX 5700 XT.

aqxa1 commented 4 years ago

I updated my comment with new tests for the current aco-navi revision. Looking good with more games working properly, although I didn't test more than 15-30 minutes for each game.

Venemo commented 4 years ago

@aqxa1 can you please make a note in your comment about which of those are native and which are using DXVK or something else?

Venemo commented 4 years ago

@shmerl Thanks for trying. I had to upgrade my Witcher 3 but then I could load your savegame without issue. However, then it hanged within a minute as I was looking and walking around in the game. Is this what you've experienced?

shmerl commented 4 years ago

@Venemo : It hanged pretty much right away, doesn't even let walking around.

Game graphics settings are on max (hairworks off). Ambient occlusion: HBAO+.

Venemo commented 4 years ago

@shmerl But does the scene load up before it hangs? Because yesterday what I would see is that it hanged on the loading screen basically. Right now it loads the game and hangs some time after that. In my case, I'm running it on 3840x2160 but with the default settings.

EDIT: if I run it without debugging options, then it indeed hangs much sooner. if I run with RADV_DEBUG=shaders,syncshaders then it lasts a bit longer.

shmerl commented 4 years ago

It hanged right away after the loading screen, the scene didn't even show up. I built Mesa without debug though, so may be that's the reason.

Also, different settings can affect what shaders are used I suppose, so may be try max settings (hairworks off) / HBAO+.

Venemo commented 4 years ago

@shmerl I added a couple of patches that fix some so-called "hazards" on Navi. I haven't yet fixed every hazard that LLVM knows about, but still a good number of them. I compiled mesa in release mode, ran the Witcher 3 with RADV_PERFTEST=aco, set all graphics settings to ultra (except hairworks like you suggested), and then I set all postprocessing to maximum, and set HBAO+ too. Then I loaded your savegame and was able to walk around in the game for a couple of minutes without a hang. Can you confirm if these latest patches help you at all?

shmerl commented 4 years ago

Just built it and I was able to load the save without a hang! I didn't test it for long after that, but so far it looks good! Thanks for working on those.

One thing I noticed though, that aco performance looks just a tiny bit lower than llvm in that case:

llvm:

tw3_llvm

aco:

tw3_aco

I hope it's not a consequence of those hazards mitigations.

Why are in general such "hazards" on Navi? Is it due to the need to support backwards compatibility with GCN ISA in RDNA?