GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan
MIT License
1.72k stars 161 forks source link

Dota 2 crashes on a call to vkCreateSwapchainWSI #165

Closed SamHSmith closed 4 years ago

SamHSmith commented 4 years ago

crash_20200614235124_1.zip

Dota 2 crashes when I start it with amdvlk. Radv works but amdvlk is broken. Log attatched.

Flakebi commented 4 years ago

Hi, can you provide more information about your system please? Dota 2 works fine for me with radv and amdvlk.

Interesting information would be:

  1. Operating system
  2. Window system: X11 or wayland (Dota 2 uses SDL1, which I think is incompatible with wayland, so it only starts with SDL_VIDEODRIVER=x11)
  3. Driver version
  4. Did you compile amdvlk yourself or do you use the prebuilt package?
  5. Kernel version
Eitetsu0 commented 4 years ago

similar problem here. Shadow of The Tomb Raider and Raise of The TombRaider crashed with signal 6 ,with a message : vkCreateSwapChainKHR failed: -13 dota2 also crashed but without any tips.

I'm using ArchLinux and its official amdvlk package. Version 2020.Q2.4-1 The kernel is 5.7.4-zen1-1-zen . I'm using sway a wayland wm. SDL_VIDEODRIVER is not set but the games used to work fine.

Flakebi commented 4 years ago

If I read things correctly, this bug has been fixed and should be in version 2020.Q2.5. Can you try the new version please? I’m also using sway, not on arch though, so no idea why it works for me. I had SDL_VIDEODRIVER=wayland set globally, that explains why I needed to set it explicitly to x11.

Eitetsu0 commented 4 years ago

Thank you for your reply.

I've just tried from version 2020.Q2.5 to version 2020.Q2.2. But the problem still the same. And I also tried some older versions. With 2020.Q2.1 to 2019.Q4.5 the game still crashed but the crash message changed to Game crashed with signal 6:vkCreateSwapchainKHR failed: -1 But the game used to work well with some old versions mentioned above. So I tried changing my kernel from 5.7.5.zen1 to 5.4.48(lts version) and reinstalled steam, but the problem still. But the game do work with RADV though it has terrible performance.

So I'm really confused. Is that a drive related problem ?

Eitetsu0 commented 4 years ago

vulkaninfo of my system: vulkaninfo.txt

JacobHeAMD commented 4 years ago

Could you please check if DRI3 is enabled? There should be a line like "AMDGPU(0): DRI3 enabled" in Xorg.0.log. Another suspecious point is both radv and amdvlk are loaded on you system. Could you please try to disable RADV? Just need to rename /usr/share/vulkan/icd.d/radeon_icd.x86_64.json to /usr/share/vulkan/icd.d/radeon_icd.x86_64.json.bak.

Eitetsu0 commented 4 years ago

I'm using wayland but yes there is a line "AMDGPU(0): DRI3 enabled" in Xorg.0.log when I launch openbox. I'm not sure if it's enabled with xwayland. How can I check that ? I tried removing RADV but amdvlk still has the same problem.

JacobHeAMD commented 4 years ago

Looks like it's a two Gpus system, possibly it's because the rendering device is not same as present device. Can you unplug one Gpu from your system? BTW, please help to do a little debug with Dri3WindowSystem::Init to find out if there is any failure? You can try with vkcube to see if there is same issue, and debug with it if yes.

Flakebi commented 4 years ago

It is only one GPU, it appears twice because both, amdvlk and radv are installed, so GPU 0 means amdvlk and GPU 1 means radv.

JacobHeAMD commented 4 years ago

Oops, yes. My bad, it's listed for different surface type in "Presentable Surfaces". Please help to debug and find out where is it failed in the code (Pal::Amgpu::SwapChain::Init).

Eitetsu0 commented 4 years ago

No I have only one GPU. I just removed RADV so vulkaninfo contains less information: vulkaninfo.txt

I'm happy to help but I know little about vulkan develop. Can you give me more guide? vkcube just crash like [1] 15626 segmentation fault (core dumped) vkcube .

JacobHeAMD commented 4 years ago
  1. Build the amdvlk according to https://github.com/GPUOpen-Drivers/AMDVLK#build-instructions
  2. gdb ./vkcube
  3. r
  4. bt Then post the call stack here. And also, you can set a break with Pal::Amgpu::SwapChain::Init to check if the result is "Success"?
Eitetsu0 commented 4 years ago

Sorry for took so long . Bad network condition and several times of failed compilation (ran out of memory😱). Below is after step 4. Pal::Amgpu::SwapChain::Init returned Util::Result::ErrorInvalidPointer.

(gdb) r
Starting program: /usr/bin/vkcube 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_create_sem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_signal_sem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_wait_sem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_export_sem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_import_sem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_destroy_sem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_create_bo_from_phys_mem (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_bo_remap_secure (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_query_private_aperture (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_query_shared_aperture (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_bo_get_phys_address (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_reserved_vmid (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_unreserved_vmid (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Failed to get function pointer for: amdgpu_cs_ctx_create3 (/home/z/amdvlk-dev/vulkandriver/drivers/pal/inc/util/palLibrary.h:84:GetFunction)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/amdgpuDevice.cpp:656:EarlyInit)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/amdgpuDevice.cpp:656:EarlyInit)
AMD-PAL: Info: Vulkan error: VK_ERROR_UNKNOWN(-13), from Pal error: Pal::Result::ErrorUnknown(-1) (/home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_conv.cpp:1016:PalToVkError)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff1b89b19 in vk::SwapChain::GetSwapchainImagesKHR (this=0x0, pCount=0x7fffffffde10, pSwapchainImages=0x0)
    at /home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_swapchain.cpp:700
700         *pCount = m_properties.imageCount;
(gdb) bt
#0  0x00007ffff1b89b19 in vk::SwapChain::GetSwapchainImagesKHR (this=0x0, pCount=0x7fffffffde10, pSwapchainImages=0x0)
    at /home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_swapchain.cpp:700
#1  0x00007ffff1b8cd6a in vk::entry::vkGetSwapchainImagesKHR (device=0x555555a11e90, swapchain=0x0, 
    pSwapchainImageCount=0x7fffffffde10, pSwapchainImages=0x0)
    at /home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_swapchain.cpp:1667
#2  0x000055555555bfcc in ?? ()
#3  0x0000555555558726 in ?? ()
#4  0x00007ffff7c25002 in __libc_start_main () from /usr/lib/libc.so.6
#5  0x000055555555987e in ?? ()
(gdb)
JacobHeAMD commented 4 years ago

This segfault is caused by the failure of vkCreateSwapChainKHR. Could you please trace Pal::Amdgpu::Dri3WindowSystem::Init() step by step to check if there is any failure there?

Eitetsu0 commented 4 years ago

No, it returned SUCCESS. But by continue tracing I found it's in pal/src/core/os/amdgpu/dri3/dri3WindowSystem.cpp : line 629 The condition pError != nullptr is True here.

        xcb_generic_error_t*const pError = m_dri3Procs.pfnXcbRequestCheck(m_pConnection, cookie);

        if (pError != nullptr)
        {
            free(pError);

            // On error, the id will be wasted because Xlib/xcb doesn't provide an interface to reclaim the id.
            result = Result::ErrorUnknown;
        }

Here is the call stack :

amdvlk64.so!Pal::Amdgpu::Dri3WindowSystem::CreatePresentableImage(Pal::Amdgpu::Dri3WindowSystem * const this, Pal::Amdgpu::SwapChain * pSwapChain, Pal::Amdgpu::Image * pImage, Pal::int32 sharedBufferFd) (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/dri3/dri3WindowSystem.cpp:646)
amdvlk64.so!Pal::Amdgpu::Image::UpdateExternalImageInfo(Pal::Amdgpu::Device * pDevice,  createInfo, Pal::GpuMemory * pGpuMemory, Pal::Image * pImage) (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/amdgpuImage.cpp:250)
amdvlk64.so!Pal::Amdgpu::Device::UpdateExternalImageInfo(Pal::Amdgpu::Device * const this,  createInfo, Pal::GpuMemory * pGpuMemory, Pal::Image * pImage) (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/amdgpuDevice.cpp:3306)
amdvlk64.so!Pal::Amdgpu::Image::CreatePresentableImage(Pal::Amdgpu::Device * pDevice, const Pal::PresentableImageCreateInfo & createInfo, void * pImagePlacementAddr, void * pGpuMemoryPlacementAddr, Pal::IImage ** ppImage, Pal::IGpuMemory ** ppGpuMemory) (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/amdgpuImage.cpp:182)
amdvlk64.so!Pal::Amdgpu::Device::CreatePresentableImage(Pal::Amdgpu::Device * const this,  createInfo, void * pImagePlacementAddr, void * pGpuMemoryPlacementAddr, Pal::IImage ** ppImage, Pal::IGpuMemory ** ppGpuMemory) (/home/z/amdvlk-dev/vulkandriver/drivers/pal/src/core/os/amdgpu/amdgpuDevice.cpp:1899)
amdvlk64.so!vk::Image::CreatePresentableImage(vk::Device * pDevice, const Pal::PresentableImageCreateInfo * pCreateInfo, const VkAllocationCallbacks * pAllocator, VkImageUsageFlags imageUsageFlags, Pal::PresentMode presentMode, VkImage * pImage, VkFormat imageFormat, VkSharingMode sharingMode, uint32_t queueFamilyIndexCount, const uint32_t * pQueueFamilyIndices, VkDeviceMemory * pDeviceMemory) (/home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_image.cpp:1092)
amdvlk64.so!vk::SwapChain::Create(vk::Device * pDevice, const VkSwapchainCreateInfoKHR * pCreateInfo, const VkAllocationCallbacks * pAllocator, VkSwapchainKHR * pSwapChain) (/home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_swapchain.cpp:414)
amdvlk64.so!vk::Device::CreateSwapchain(vk::Device * const this, const VkSwapchainCreateInfoKHR * pCreateInfo, const VkAllocationCallbacks * pAllocator, VkSwapchainKHR * pSwapChain) (/home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_device.cpp:2751)
amdvlk64.so!vk::entry::vkCreateSwapchainKHR(VkDevice device, const VkSwapchainCreateInfoKHR * pCreateInfo, const VkAllocationCallbacks * pAllocator, VkSwapchainKHR * pSwapchain) (/home/z/amdvlk-dev/vulkandriver/drivers/xgl/icd/api/vk_device.cpp:3550)
libvulkan.so.1![Unknown/Just-In-Time compiled code] (Unknown:0)
libc.so.6!__libc_start_main (Unknown:0)
[Unknown/Just-In-Time compiled code] (Unknown:0)
JacobHeAMD commented 4 years ago

Did you install the latest libdrm with ppa? If so, could you please try to uninstall ppa libdrm?

Eitetsu0 commented 4 years ago

I have libdrm2.4.102 installed from archlinux official repo.

I checked the log and tried to downgrad it to version 2.4.101 (which I was having before Jun 3rd) and version 2.4.100 (before April 11th . Games worked fine with amdvlk at this time.) . but vkcube still crash like above.

Eitetsu0 commented 4 years ago

It worked ! I just installed openbox and tried vktube and some games with x11 and they didn't crash. 🤣 So it seems to be the XWayland? But still don't know why RADV works.

Eitetsu0 commented 4 years ago

Yeah I could try 2.4.92 later. Everything can be too new in Arch Linux official repositories .😂

JacobHeAMD commented 4 years ago

This issue was catched by our CQE occationally and the issue is not seen after removing the ppa. Can you try it with ppa-purge? I thought it's caused by libdrm mismatch since it's a failure to create pixmap from fd(need to import the buffer from fd on server end). But looks like it's not per your test. There are some other libraryies with this ppa, please have a try with removing it.

Flakebi commented 4 years ago

For reference, I’m using 2.4.100 and didn’t notice problems so far. (Also tried to get arch working in a vm but even vulkaninfo refuses to run with amdvlk, radv works though.)

Eitetsu0 commented 4 years ago

uh.. Sorry I'm not using a deb-based distro. Archlinux dosen't use PAA but they usually provide latest version software from their official repository.

And yes I used 2.4.100 before , it used to work fine.

JacobHeAMD commented 4 years ago

@qiaojbao found that the issue is gone after libglx-mesa-dri is restored to the old one. It's possible a the compatible issue of amdvlk and the new mesa OGL driver. @Eitetsu0 , please have a try with older mesa.

Eitetsu0 commented 4 years ago

It seems mesa older than 20.0.7-3 works for me. 20.1.1-1 and 20.1.2-1 will cause the issue above.

SamHSmith commented 4 years ago

Woah, I'm happy others have had this issue. I'm running amdvlk-2020.Q2.4_1 and swapchain creation is failing for all vulkan applications I have tested.

Flakebi commented 4 years ago

The issue is there since mesa a3dc7fffbb7be0f1b2ac478b16d3acc5662dff66 – ac/surface: don't compute DCC if it's unsupported by DCN on gfx9+

qiaojbao commented 4 years ago

For Mesa20.1 and 20.2 code. In si_texture.c si_check_resource_capability() { ... if (bind & PIPE_BIND_SCANOUT && !tex->surface.is_displayable) return false; ... } the tex->surface.is_displayable=0, gbm_dri_bo_import() return NULL, so the Xwayland create image failed in glamor_pixmap_from_fds().

suface.is_displayable value is initialized in gfx9_compute_surface(), by is_dcc_supported_by_DCN(). but is_dcc_supported_by_DCN() return false in the begaining of code.

static bool is_dcc_supported_by_DCN(const struct radeon_info info, const struct ac_surf_config config, const struct radeon_surf *surf, bool rb_aligned, bool pipe_aligned) { if (!info->use_display_dcc_unaligned && !info->use_display_dcc_with_retile_blit) return false; ... because "use_display_dcc_unaligned" and "use_display_dcc_with_retile_blit" are all false for Navi10 card.

According to the initialzation code of these two variables in ac_gpu_info.c

    if ((info->drm_minor >= 31 &&
         (info->family == CHIP_RAVEN ||
          info->family == CHIP_RAVEN2 ||
          info->family == CHIP_RENOIR)) ||
        (info->drm_minor >= 34 &&
         (info->family == CHIP_NAVI12 ||
          info->family == CHIP_NAVI14))) {
    if (info->num_render_backends == 1)
        info->use_display_dcc_unaligned = true;
    else
        info->use_display_dcc_with_retile_blit = true;
}

there do nothing for Navi10 chip.

I try to add CHIP_NAVI10 in code, and it works, but Navi10 not support display dcc.

    if ((info->drm_minor >= 31 &&
         (info->family == CHIP_RAVEN ||
          info->family == CHIP_RAVEN2 ||
          info->family == CHIP_RENOIR)) ||
        (info->drm_minor >= 34 &&
         (info->family == CHIP_NAVI10 ||
          info->family == CHIP_NAVI12 ||
          info->family == CHIP_NAVI14))) {
    if (info->num_render_backends == 1)
        info->use_display_dcc_unaligned = true;
    else
        info->use_display_dcc_with_retile_blit = true;
}
qiaojbao commented 4 years ago

Back to function gfx9_compute_surface(),

    if (surf->num_dcc_levels &&
        !is_dcc_supported_by_DCN(info, config, surf,
                     surf->u.gfx9.dcc.rb_aligned,
                     surf->u.gfx9.dcc.pipe_aligned))
        displayable = false;

If Navi10 not support display dcc, so "surf->num_dcc_levels" needs equal with false. "surf->num_dcc_levels" initialize value equals 0, and be setted in gfx9_compute_miptree().

    if (info->has_graphics &&
        !(surf->flags & RADEON_SURF_DISABLE_DCC) &&
        !compressed &&
        is_dcc_supported_by_CB(info, in->swizzleMode) &&
        (!in->flags.display ||
         is_dcc_supported_by_DCN(info, config, surf,
                     !in->flags.metaRbUnaligned,
                     !in->flags.metaPipeUnaligned))) {
                ...
        surf->num_dcc_levels = in->numMipLevels;

We noticed this flags.display should be true, but it get a false vaule in gfx9_compute_surface().

AddrSurfInfoIn.flags.display = get_display_flag(config, surf);

static bool get_display_flag(const struct ac_surf_config config, const struct radeon_surf surf) { unsigned num_channels = config->info.num_channels; unsigned bpe = surf->bpe;

if (!config->is_3d &&
    !config->is_cube &&
    !(surf->flags & RADEON_SURF_Z_OR_SBUFFER) &&
    surf->flags & RADEON_SURF_SCANOUT &&
    config->info.samples <= 1 &&
    surf->blk_w <= 2 && surf->blk_h == 1) {

...

the reason is surf->flags not contain RADEON_SURF_SCANOUT. Go to surf->flags initialzation code in si_init_surface().

if (is_scanout) { / This should catch bugs in gallium users setting incorrect flags. / assert(ptex->nr_samples <= 1 && ptex->array_size == 1 && ptex->depth0 == 1 && ptex->last_level == 0 && !(flags & RADEON_SURF_Z_OR_SBUFFER));

  flags |= RADEON_SURF_SCANOUT;

}

the "is_scanout" equals false here, so caused the wrong flags. it is from si_texture_from_winsys_buffer(),

  sscreen->ws->buffer_get_metadata(buf, &metadata);
  si_get_display_metadata(sscreen, &surface, &metadata, &array_mode, &is_scanout);

si_get_display_metadata() { ... *is_scanout = metadata->u.gfx9.scanout; ... }

So, this should be about the metadata our AMDVLK send. In radv_amdgpu_winsys_bo_get_metadata() function,

md->u.gfx9.scanout = AMDGPU_TILING_GET(tiling_flags, SCANOUT);

define AMDGPU_TILING_SCANOUT_SHIFT 63

define AMDGPU_TILING_SCANOUT_MASK 1

define AMDGPU_TILING_GET(value, field) \

(((__u64)(value) >> AMDGPU_TILING_##field##_SHIFT) & AMDGPU_TILING_##field##_MASK)

Back to AMDVLK code, in UpdateMetaData(), for >= gfx9, tiling_flags max value is 33, so mesa could not get the correct scanout value. This issue will fixed in AMDVLK driver.

SamHSmith commented 4 years ago

Nice, does that mean installing a future version of AMDVLK will solve the crashes?

JacobHeAMD commented 4 years ago

Right.

SamHSmith commented 4 years ago

It works on my system now! @JacobHeAMD thanks for helping out.