GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan
MIT License
1.75k stars 162 forks source link

segfault in Pal::Device::HwlEarlyInit on r9 270x #50

Closed Francesco149 closed 3 years ago

Francesco149 commented 6 years ago

the kernel has amdgpu.si_support=1 amdgpu.cik_support=1 and the regular amdgpu driver works fine with vulkan

I have verified that my linux-firmware is up to date (20180825)

backtrace of vulkaninfo with amdvlk built in debug mode

==========
VULKANINFO
==========

Vulkan Instance Version: 1.1.82

AMD-PAL: Error: Unconditional Assert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/hw/gfxip/gfx6/gfx6Device.cpp:2544:DetermineIpLevel)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:76:~MemTracker)
AMD-PAL: Warn: ================ List of Leaked Blocks ================ (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:338:MemoryReport)
AMD-PAL: Warn: ClientMem = 0x0x5555556185b0, AllocSize =     1424, MemBlkType = New, File = /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxVamMgr.cpp, LineNumber =      431, AllocNum =        1 (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:349:MemoryReport)
AMD-PAL: Warn: ================ End of List =========================== (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:352:MemoryReport)
AMD-PAL: Error: Assertion failed: (m_force32BitVaSpace || (usableVaRangeBitLimit >= MinVaRangeNumBits)) && (usableVaRangeBitLimit <= m_chipProperties.gfxip.vaRangeNumBits) | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:917:FixupUsableGpuVirtualAddressRange)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:604:EarlyInit)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5375866 in Pal::Device::HwlEarlyInit (
    this=this@entry=0x55555561ffc0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:485
485         pfnTable.pfnCreateFmaskViewSrds      = &DefaultCreateFmaskViewSrds;
(gdb) bt
#0  0x00007ffff5375866 in Pal::Device::HwlEarlyInit (
    this=this@entry=0x55555561ffc0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:485
#1  0x00007ffff5375be8 in Pal::Device::EarlyInit (
    this=this@entry=0x55555561ffc0, ipLevels=...)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:423
#2  0x00007ffff53a6a58 in Pal::Linux::Device::EarlyInit (this=0x55555561ffc0, 
    ipLevels=...)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:617
#3  0x00007ffff53a623d in Pal::Linux::Device::Create (
    pPlatform=pPlatform@entry=0x5555556170e8, 
    pSettingsPath=pSettingsPath@entry=0x555555617244 "/etc/amd", 
    pBusId=pBusId@entry=0x7fffffffd4f0 "pci:0000:02:00.0", 
    pPrimaryNode=0x55555561fe78 "/dev/dri/card0", 
    pRenderNode=0x55555561fea8 "/dev/dri/renderD128", pciBusInfo=..., 
    deviceIndex=0, ppDeviceOut=0x7fffffffd468)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:218
#4  0x00007ffff525f58b in Pal::Linux::Platform::ReQueryDevices (
    this=0x5555556170e8)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxPlatform.cpp:201
#5  0x00007ffff5258409 in Pal::Platform::ReEnumerateDevices (
    this=this@entry=0x5555556170e8)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/platform.cpp:599
#6  0x00007ffff5258e0d in Pal::Platform::Init (this=0x5555556170e8)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/platform.cpp:332
#7  0x00007ffff52580b5 in Pal::Platform::Create (createInfo=..., allocCb=..., 
    pPlacementAddr=<optimized out>, ppPlatform=ppPlatform@entry=0x7fffffffd6e0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/platform.cpp:165
#8  0x00007ffff5256951 in Pal::CreatePlatform (createInfo=..., 
    pPlacementAddr=<optimized out>, pPlacementAddr@entry=0x555555616020, 
    ppPlatform=ppPlatform@entry=0x555555608e80)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/libInit.cpp:165
#9  0x00007ffff46406f2 in vk::Instance::Init (this=this@entry=0x555555608e80, 
    pAppInfo=pAppInfo@entry=0x7fffffffe440)
    at /home/loli/aur/amdvlk-git/src/xgl/icd/api/vk_instance.cpp:315
#10 0x00007ffff4641195 in vk::Instance::Create (pCreateInfo=<optimized out>, 
    pAllocator=<optimized out>, pInstance=0x5555555a88b8)
    at /home/loli/aur/amdvlk-git/src/xgl/icd/api/vk_instance.cpp:198
#11 0x00007ffff7d5011e in ?? () from /usr/lib/libvulkan.so.1
#12 0x00007ffff7d53d19 in ?? () from /usr/lib/libvulkan.so.1
#13 0x00007ffff7d57dfe in vkCreateInstance () from /usr/lib/libvulkan.so.1
#14 0x0000555555556522 in ?? ()
Zakhrov commented 6 years ago

Kernel 4.18 does not have the new SI firmware paths patched into it AFAIK. You still need either the radeon firmware from AMDGPU-PRO or you need to copy the pitcairn/GFX6 firmware files from /lib/firmware/amdgpu to /lib/firmware/radeon. Linux 4.19 has the new SI paths and so doesn't need any modification

Francesco149 commented 6 years ago

hmm, i just tried both the 4.19-rc2 release and the latest git kernel (rc2 should have the paths fix you mention though) and I still get the segfault but it's slightly different.

if i run vulkaninfo with gdb I get what looks to be the same segfault as before:

Starting program: /usr/bin/env VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/amd_icd64.json vulkaninfo
process 783 is executing new program: /usr/bin/vulkaninfo
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
==========
VULKANINFO
==========

Vulkan Instance Version: 1.1.82

(gdb) AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:76:~MemTracker)
AMD-PAL: Warn: ================ List of Leaked Blocks ================ (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:338:MemoryReport)
AMD-PAL: Warn: ClientMem = 0x0x555555619ec0, AllocSize =     1424, MemBlkType = New, File = /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxVamMgr.cpp, LineNumber =      431, AllocNum =        1 (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:349:MemoryReport)
AMD-PAL: Warn: ================ End of List =========================== (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:352:MemoryReport)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:604:EarlyInit)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5372866 in Pal::Device::HwlEarlyInit (
    this=this@entry=0x5555556219a0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:485
485         pfnTable.pfnCreateFmaskViewSrds      = &DefaultCreateFmaskViewSrds;
(gdb) bt
#0  0x00007ffff5372866 in Pal::Device::HwlEarlyInit (
    this=this@entry=0x5555556219a0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:485
#1  0x00007ffff5372be8 in Pal::Device::EarlyInit (
    this=this@entry=0x5555556219a0, ipLevels=...)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:423
#2  0x00007ffff53a3a58 in Pal::Linux::Device::EarlyInit (this=0x5555556219a0, 
    ipLevels=...)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:617
#3  0x00007ffff53a323d in Pal::Linux::Device::Create (
    pPlatform=pPlatform@entry=0x555555615f38, 
    pSettingsPath=pSettingsPath@entry=0x555555616094 "/etc/amd", 
    pBusId=pBusId@entry=0x7fffffffd4d0 "pci:0000:02:00.0", 
    pPrimaryNode=0x555555621768 "/dev/dri/card0", 
    pRenderNode=0x555555621798 "/dev/dri/renderD128", pciBusInfo=..., 
    deviceIndex=0, ppDeviceOut=0x7fffffffd448)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:218
#4  0x00007ffff525c58b in Pal::Linux::Platform::ReQueryDevices (
    this=0x555555615f38)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxPlatform.cpp:201
#5  0x00007ffff5255409 in Pal::Platform::ReEnumerateDevices (
    this=this@entry=0x555555615f38)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/platform.cpp:599
#6  0x00007ffff5255e0d in Pal::Platform::Init (this=0x555555615f38)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/platform.cpp:332
#7  0x00007ffff52550b5 in Pal::Platform::Create (createInfo=..., allocCb=..., 
    pPlacementAddr=<optimized out>, ppPlatform=ppPlatform@entry=0x7fffffffd6c0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/platform.cpp:165
#8  0x00007ffff5253951 in Pal::CreatePlatform (createInfo=..., 
    pPlacementAddr=<optimized out>, pPlacementAddr@entry=0x555555614e70, 
    ppPlatform=ppPlatform@entry=0x555555607cd0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/libInit.cpp:165
#9  0x00007ffff463d6f2 in vk::Instance::Init (this=this@entry=0x555555607cd0, 
    pAppInfo=pAppInfo@entry=0x7fffffffe420)
    at /home/loli/aur/amdvlk-git/src/xgl/icd/api/vk_instance.cpp:315
#10 0x00007ffff463e195 in vk::Instance::Create (pCreateInfo=<optimized out>, 
    pAllocator=<optimized out>, pInstance=0x5555555a7708)
    at /home/loli/aur/amdvlk-git/src/xgl/icd/api/vk_instance.cpp:198

if i run it without gdb, around 50% of the time it manages to print some info before segfaulting https://gist.github.com/b4618f0fa6d60b25af8f1306a8be9420

if i try to run vkquake (which works fine on the regular amdgpu driver) I get this segfault:

Starting program: /usr/bin/env VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/amd_icd64.json vkquake -basedir /home/loli/.steam/steam/steamapps/common/Quake
process 1398 is executing new program: /usr/bin/vkquake
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Command line: vkquake -basedir /home/loli/.steam/steam/steamapps/common/Quake
Found SDL version 2.0.8
Detected 8 CPUs.
Quake 1.09 (c) id Software
GLQuake 1.00 (c) id Software
FitzQuake 0.85 (c) John Fitzgibbons
FitzQuake SDL port (c) SleepwalkR, Baker
QuakeSpasm 0.93.0 (c) Ozkan Sezer, Eric Wasylishen & others
vkQuake 1.00.0 (c) Axel Gneiting & others
Host_Init
Playing registered version.
Console initialized.
UDP Initialized
Server using protocol 666 (FitzQuake)
Exe: 13:31:04 Sep  1 2018
256.0 megabyte heap
[New Thread 0x7fffd9b3d700 (LWP 1402)]
[New Thread 0x7fffd91fb700 (LWP 1403)]
[New Thread 0x7fffd89fa700 (LWP 1404)]
[New Thread 0x7fffcbfff700 (LWP 1405)]
[New Thread 0x7fffcb7fe700 (LWP 1406)]
[New Thread 0x7fffcaffd700 (LWP 1407)]
[New Thread 0x7fffca7fc700 (LWP 1408)]
[New Thread 0x7fffc9ffb700 (LWP 1409)]
[New Thread 0x7fffc97fa700 (LWP 1410)]
[New Thread 0x7fffc8ff9700 (LWP 1411)]
[New Thread 0x7fffabfff700 (LWP 1412)]
[New Thread 0x7fffab7fe700 (LWP 1413)]
[New Thread 0x7fffaaffd700 (LWP 1414)]
[Thread 0x7fffaaffd700 (LWP 1414) exited]
[New Thread 0x7fffaaffd700 (LWP 1415)]
[Thread 0x7fffaaffd700 (LWP 1415) exited]

Vulkan Initialization
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:76:~MemTracker)
AMD-PAL: Warn: ================ List of Leaked Blocks ================ (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:338:MemoryReport)
AMD-PAL: Warn: ClientMem = 0x0x555557b65930, AllocSize =     1424, MemBlkType = New, File = /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxVamMgr.cpp, LineNumber =      431, AllocNum =        1 (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:349:MemoryReport)
AMD-PAL: Warn: ================ End of List =========================== (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:352:MemoryReport)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:604:EarlyInit)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:604:EarlyInit)
Vendor: AMD
Device: AMD Radeon(TM) HD 8800 Series
Using VK_KHR_DEDICATED_ALLOCATION
Using A2B10G10R10 color buffer format
Using D32 depth buffer format
Creating command buffers

Thread 1 "vkquake" received signal SIGSEGV, Segmentation fault.
0x00007fff99dbe4e9 in Pal::ICmdBuffer::ICmdBuffer (this=<optimized out>)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/cmdBuffer.cpp:104
104     CmdBuffer::CmdBuffer(
(gdb) bt
#0  0x00007fff99dbe4e9 in Pal::ICmdBuffer::ICmdBuffer (this=<optimized out>)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/cmdBuffer.cpp:104
#1  Pal::CmdBuffer::CmdBuffer (this=0x55555b6ffcd0, device=..., createInfo=...)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/cmdBuffer.cpp:124
#2  0x00007fff99e1276b in Pal::GfxCmdBuffer::GfxCmdBuffer (
    this=0x55555b6ffcd0, device=..., createInfo=..., 
    pPrefetchMgr=0x55555b7021c8)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/hw/gfxip/gfxDevice.h:481
#3  0x00007fff99e1dab1 in Pal::UniversalCmdBuffer::UniversalCmdBuffer (
    this=0x55555b6ffcd0, device=..., createInfo=..., 
    pPrefetchMgr=<optimized out>, pDeCmdStream=<optimized out>, 
    pCeCmdStream=0x55555b702858, blendOptEnable=true)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/hw/gfxip/universalCmdBuffer.cpp:46
#4  0x00007fff99ce6619 in Pal::Gfx6::UniversalCmdBuffer::UniversalCmdBuffer (
    this=0x55555b6ffcd0, device=..., createInfo=...)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/hw/gfxip/gfx6/gfx6SettingsLoader.h:52
#5  0x00007fff99cb4fb4 in Pal::Gfx6::Device::CreateCmdBuffer (
    this=0x555557b63880, createInfo=..., pPlacementAddr=<optimized out>, 
    ppCmdBuffer=0x7fffffffe240)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/hw/gfxip/gfx6/gfx6Device.cpp:1253
#6  0x00007fff99dc9a0b in Pal::Device::ConstructCmdBuffer (
    this=0x555557b5bec0, createInfo=..., pPlacementAddr=0x55555b6ffcd0, 
    ppCmdBuffer=ppCmdBuffer@entry=0x7fffffffe2c0)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:2447
#7  0x00007fff99dc9b20 in Pal::Device::CreateCmdBuffer (this=<optimized out>, 
    createInfo=..., pPlacementAddr=<optimized out>, ppCmdBuffer=0x55555b6fda70)
    at /home/loli/aur/amdvlk-git/src/pal/src/core/device.cpp:2478
#8  0x00007fff99041147 in vk::CmdBuffer::Initialize (
    this=this@entry=0x55555b6fda48, pPalMem=pPalMem@entry=0x55555b6ffcd0, 
    pVbMem=pVbMem@entry=0x55555b7031c0, createInfo=...)
    at /home/loli/aur/amdvlk-git/src/pal/inc/util/palInlineFuncs.h:82
#9  0x00007fff9904890a in vk::CmdBuffer::Create (pDevice=0x5555582a5878, 
    pAllocateInfo=<optimized out>, pCommandBuffers=0x5555556446e0)
    at /home/loli/aur/amdvlk-git/src/xgl/icd/api/vk_cmdbuffer.cpp:483
#10 0x00007ffff7df7b95 in vkAllocateCommandBuffers ()
   from /usr/lib/libvulkan.so.1
random2324 commented 6 years ago

I think I have exactly the same problem with another SI card: Radeon 7970 (Tahiti). Radv is working fine. All vulkan apps cause a segmentation fault.

Arch Linux. LLVM 8.0.0 svn Mesa 18.3 git Linux-Firmware git amdvlk-git

FichteFoll commented 6 years ago

4.19 has been released on the Arch repos. I just tested with 4.19.1 and am not experiencing segfaults anymore (previously: https://github.com/mpv-player/mpv/issues/6084; also occured with minimal testing apps but didn't have debug symbols).

My GPU is a 7950 Boost (tahiti).

jinjianrong commented 6 years ago

@FichteFoll thanks for the update.

Francesco149 commented 5 years ago

vulkaninfo still seems to be crashing for me, but with a different stacktrace. it's a general protection fault in strstr

==========
VULKANINFO
==========

Vulkan Instance Version: 1.1.85

==12369== 
==12369== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==12369==  General Protection Fault
==12369==    at 0x78E17E1: strstr (string.h:324)
==12369==    by 0x78E17E1: Pal::Linux::Device::Create(Pal::Linux::Platform*, char const*, char const*, char const*, char const*, _drmPciBusInfo const&, unsigned int, Pal::Linux::Device**) (lnxDevice.cpp:202)
==12369==    by 0x77944EA: Pal::Linux::Platform::ReQueryDevices() (lnxPlatform.cpp:202)
==12369==    by 0x778C308: Pal::Platform::ReEnumerateDevices() (platform.cpp:640)
==12369==    by 0x778CD02: Pal::Platform::Init() (platform.cpp:340)
==12369==    by 0x778C789: Pal::Platform::Create(Pal::PlatformCreateInfo const&, Util::AllocCallbacks const&, void*, Pal::Platform**) (platform.cpp:169)
==12369==    by 0x778A800: Pal::CreatePlatform(Pal::PlatformCreateInfo const&, void*, Pal::IPlatform**) (libInit.cpp:165)
==12369==    by 0x6B0C6E1: vk::Instance::Init(VkApplicationInfo const*) (vk_instance.cpp:319)
==12369==    by 0x6B0D18F: vk::Instance::Create(VkInstanceCreateInfo const*, VkAllocationCallbacks const*, VkInstance_T**) (vk_instance.cpp:201)
==12369==    by 0x48BDADD: ??? (in /usr/lib/libvulkan.so.1.1.85)
==12369==    by 0x48C16D8: ??? (in /usr/lib/libvulkan.so.1.1.85)
==12369==    by 0x48C57CD: vkCreateInstance (in /usr/lib/libvulkan.so.1.1.85)
==12369==    by 0x10A521: ??? (in /usr/bin/vulkaninfo)

vkquake is also still crashing with what seems to be the same error as before

Using VK_KHR_DEDICATED_ALLOCATION
Using A2B10G10R10 color buffer format
Using D32 depth buffer format
Creating command buffers
==12483== 
==12483== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==12483==  General Protection Fault
==12483==    at 0x28963DC9: ICmdBuffer (palCmdBuffer.h:3114)
==12483==    by 0x28963DC9: Pal::CmdBuffer::CmdBuffer(Pal::Device const&, Pal::CmdBufferCreateInfo const&) (cmdBuffer.cpp:130)
==12483==    by 0x289BFAAA: Pal::GfxCmdBuffer::GfxCmdBuffer(Pal::GfxDevice const&, Pal::CmdBufferCreateInfo const&, Pal::PrefetchMgr*) (gfxCmdBuffer.cpp:69)
==12483==    by 0x289C9C70: Pal::UniversalCmdBuffer::UniversalCmdBuffer(Pal::GfxDevice const&, Pal::CmdBufferCreateInfo const&, Pal::PrefetchMgr*, Pal::GfxCmdStream*, Pal::GfxCmdStream*, bool) (universalCmdBuffer.cpp:60)
==12483==    by 0x28885668: Pal::Gfx6::UniversalCmdBuffer::UniversalCmdBuffer(Pal::Gfx6::Device const&, Pal::CmdBufferCreateInfo const&) (gfx6UniversalCmdBuffer.cpp:222)
==12483==    by 0x288551E3: Pal::Gfx6::Device::CreateCmdBuffer(Pal::CmdBufferCreateInfo const&, void*, Pal::CmdBuffer**) (gfx6Device.cpp:1262)
==12483==    by 0x2896F92A: Pal::Device::ConstructCmdBuffer(Pal::CmdBufferCreateInfo const&, void*, Pal::CmdBuffer**) const (device.cpp:2360)
==12483==    by 0x2896FA3F: Pal::Device::CreateCmdBuffer(Pal::CmdBufferCreateInfo const&, void*, Pal::ICmdBuffer**) (device.cpp:2391)
==12483==    by 0x27B74F2D: vk::CmdBuffer::Initialize(void*, void*, Pal::CmdBufferCreateInfo const&) (vk_cmdbuffer.cpp:534)
==12483==    by 0x27B7C5A1: vk::CmdBuffer::Create(vk::Device*, VkCommandBufferAllocateInfo const*, VkCommandBuffer_T**) (vk_cmdbuffer.cpp:483)
==12483==    by 0x4A1F554: vkAllocateCommandBuffers (in /usr/lib/libvulkan.so.1.1.85)
==12483==    by 0x1284A5: ??? (in /usr/bin/vkquake)
==12483==    by 0x16D6AE: ??? (in /usr/bin/vkquake)

mpv is crashing too

AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:76:~MemTracker)
AMD-PAL: Warn: ================ List of Leaked Blocks ================ (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:338:MemoryReport)
AMD-PAL: Warn: ClientMem = 0x0x1a46f400, AllocSize =     1424, MemBlkType = New, File = /home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxVamMgr.cpp, LineNumber =      440, AllocNum =        1 (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:349:MemoryReport)
AMD-PAL: Warn: ================ End of List =========================== (/home/loli/aur/amdvlk-git/src/pal/inc/util/palMemTrackerImpl.h:352:MemoryReport)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:617:EarlyInit)
AMD-PAL: Warn: Unconditional Alert | Reason: Unknown (/home/loli/aur/amdvlk-git/src/pal/src/core/os/lnx/lnxDevice.cpp:617:EarlyInit)
==12666== 
==12666== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==12666==  General Protection Fault
==12666==    at 0x1C4C3D17: Pal::CmdAllocator::FreeAllChunks() (cmdAllocator.cpp:264)
==12666==    by 0x1C4C4294: Pal::CmdAllocator::~CmdAllocator() (cmdAllocator.cpp:207)
==12666==    by 0x1C4C3C59: Destroy (cmdAllocator.h:59)
==12666==    by 0x1C4C3C59: Pal::CmdAllocator::DestroyInternal() (cmdAllocator.cpp:383)
==12666==    by 0x1C3BC01B: Pal::Device::Cleanup() (device.cpp:349)
==12666==    by 0x1C3E7F07: Pal::Linux::Device::Cleanup() (lnxDevice.cpp:430)
==12666==    by 0x1C292271: Pal::Platform::TearDownDevices() (platform.cpp:302)
==12666==    by 0x1C299E0C: Pal::Linux::Platform::Destroy() (lnxPlatform.cpp:73)
==12666==    by 0x1C38E7BD: Destroy (decorators.h:239)
==12666==    by 0x1C38E7BD: Pal::InterfaceLogger::Platform::Destroy() (interfaceLoggerPlatform.cpp:806)
==12666==    by 0x1B61403F: vk::Instance::Destroy() (vk_instance.cpp:586)
==12666==    by 0x7A26146: ??? (in /usr/lib/libvulkan.so.1.1.85)
==12666==    by 0x7A2FC90: vkDestroyInstance (in /usr/lib/libvulkan.so.1.1.85)
==12666==    by 0x23A446: ??? (in /usr/bin/mpv)

tested on 4.19.2-arch1-1-ARCH, 4.19.1-zen2-2-zen and 4.20.0-rc2-mainline, same result on all these kernels

commits:

Francesco149 commented 5 years ago

i tried debugging this a bit and it seems that the PAL_MALLOC_BASE call in Device::Create does something bad because trying to do anything in the pMemory != nullptr block (even just printing a log message) results in a general protection fault

random2324 commented 5 years ago

The crash still happen also with my 7970. Just checked out current Master.

amingriyue commented 5 years ago

I'm tried to reproduce your issue, but failed on my platform, same version as yours: testing application: vulkaninfo and cube cards: Polaris10 and Vega10 ubuntu18.04 kernel: original ubuntu 18.04 kernel 4.15.0 and kernel 4.18.5 llpc: e4edfd4ff45eed666825494c966955963e908ae2 llvm: 678b8d52b91af51de5839f44144701432df30a00 pal: 2e0f13d76846e9623cd84141ab30aaa28560a348 xgl: 4730177e34e414e233cddfbe923ef64b7aac5f83

Also, I tried both the latest master and dev, everything works well.

I'm not sure what problem happens on yours, seems amdvlk has no problem. In addition to that, our Jenkins automation runs master as well.

amingriyue commented 5 years ago

Ah, I reproduced your issue when I use Tahiti card. As 2nd floor Zakhrov mentioned, it does be a firmware updating problem. You not only need to update kernel to 4.19, but also need copy newest firmware to /lib/firmware/radeon. Following the below method, you can get newest firmware from amdgpu-pro driver:

  1. go to https://www.amd.com/en/support/graphics/amd-radeon-hd/amd-radeon-hd-7000-series/amd-radeon-hd-7770-ghz-edition, select Ubuntu x86 64-Bit to download. e.g. amdgpu-pro-18.40-676022-ubuntu-18.04.tar.xz
  2. xz -d amdgpu-pro-18.40-676022-ubuntu-18.04.tar.xz && tar xvf amdgpu-pro-18.40-676022-ubuntu-18.04.tar
  3. cd amdgpu-pro-18.40-676022-ubuntu-18.04
  4. dpkg -x amdgpu-dkms_18.40-676022_all.deb dkms
  5. firmwares locates under dkms/usr/src/amdgpu-18.40-676022/firmware/amdgpu, copy them to /lib/firmware/amdgpu
amingriyue commented 5 years ago

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu You can also get the right firmware from here, copy them to /lib/firmware/amdgpu/, and in the meantime, you should use kernel 4.19 as well.

Cheers.

amingriyue commented 5 years ago

@Francesco149 , How about the result after you updating firmware? anything I can help?

Francesco149 commented 5 years ago

i was already using linux-firmware-git, but i copied the firmware from that deb and rebooted and it doesn't seem to help either

JacobHeAMD commented 5 years ago

did you run "update-initramfs -u" after copying the firmware?

Francesco149 commented 5 years ago

hm i re-ran the mkinitcpio -p linux just to make sure and reboot but same result

is there any way to check for sure if i'm running the correct firmware?

amingriyue commented 5 years ago

yes, you can "sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info", ME feature version should be >= 25, otherwise, amdvlk will think it's an unknown card.

If there is no this debug file, then I guess you don't load amdgpu kernel driver. Then, could you confirm which driver is loaded in kernel? did you add radeon to blacklist? ('blacklist radeon' to end of /etc/modprobe.d/blacklist.conf)

Francesco149 commented 5 years ago

it does look like the correct firmware

$ sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00a47500
ME feature version: 29, firmware version: 0x00000091
PFP feature version: 29, firmware version: 0x00000054
CE feature version: 29, firmware version: 0x0000003d
RLC feature version: 1, firmware version: 0x00000007
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 0, firmware version: 0x00000000
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x10020000
SDMA0 feature version: 0, firmware version: 0x00000000
SDMA1 feature version: 0, firmware version: 0x00000000
VCN feature version: 0, firmware version: 0x00000000
VBIOS version: 113-1E27100-O48

and yea I'm sure i'm running amdgpu because I've been playing vulkan games on open-source amdgpu driver (that wouldn't work on the regular radeon driver)

amingriyue commented 5 years ago

Yeah, your firmware is correct. Can you paste your dmesg?

Francesco149 commented 5 years ago
$ dmesg | grep amdgpu
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-zen root=UUID=86d4f775-4e78-4a67-8017-d58293bc5e3d rw quiet radeon.si_support=1 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.gpu_recovery=1
[    0.121632] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-linux-zen root=UUID=86d4f775-4e78-4a67-8017-d58293bc5e3d rw quiet radeon.si_support=1 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.gpu_recovery=1
[    1.423156] [drm] amdgpu kernel modesetting enabled.
[    1.423724] fb: switching to amdgpudrmfb from EFI VGA
[    1.431023] amdgpu 0000:02:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[    1.431024] amdgpu 0000:02:00.0: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[    1.431106] [drm] amdgpu: 2048M of VRAM memory ready
[    1.431107] [drm] amdgpu: 3072M of GTT memory ready.
[    1.431725] amdgpu 0000:02:00.0: PCIE GART of 1024M enabled (table at 0x000000F400300000).
[    1.431816] [drm] amdgpu: dpm initialized
[    1.698571] fbcon: amdgpudrmfb (fb0) is primary device
[    1.880876] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device
[    2.189511] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:02:00.0 on minor 0
$ dmesg | grep radeon
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-zen root=UUID=86d4f775-4e78-4a67-8017-d58293bc5e3d rw quiet radeon.si_support=1 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.gpu_recovery=1
[    0.121632] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-linux-zen root=UUID=86d4f775-4e78-4a67-8017-d58293bc5e3d rw quiet radeon.si_support=1 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.gpu_recovery=1
[    2.207876] [drm] radeon kernel modesetting enabled.

hmm maybe i should try blacklisting radeon entirely after all? i see it's enabling kernel modesetting for both radeon and amdgpu

amingriyue commented 5 years ago

In fact, I want to see all dmesg but not grepped. And yes, please add radeon to blacklist.

btw, everything works well on my Tahiti card.

random2324 commented 5 years ago

I dont think it has something to do with the firmware. I just updated firmware-git and removed every tahiti related from /lib/firmware/radeon. Still no go.

https://pastebin.com/77dbvaYR

FichteFoll commented 5 years ago

Please wrap long code segments like this in <details></details>. You may need to surround those with blank lines to allow GitHub to parse ``` as code blocks.

random2324 commented 5 years ago

I tried now some things but still no success. I downloaded amd 18.4 drivers and extracted the firmware into /lib/firmware/radeon again. I recognized, that the 18.4 has no tahiti firmware under amdgpu just under radeon, I wonder if amdvlk just looks for firmware files under /radeon. In my /lib/firmware/radeon folder, I saw kind of a mess. I had lot of files, TAHITI and tahiti named files. TAHITI_ files looked like really old ones and I deleted them and copied new ones over from the driver package but that didnt solved the problem.

Thread 1 "vkquake" received signal SIGSEGV, Segmentation fault. 0x00007fff9dede440 in ?? () from /usr/lib/amdvlk64.so (gdb) bt

0 0x00007fff9dede440 in ?? () from /usr/lib/amdvlk64.so

1 0x00007fff9deea0cc in ?? () from /usr/lib/amdvlk64.so

2 0x00007fff9dceee78 in ?? () from /usr/lib/amdvlk64.so

3 0x00007fff9dee3ad8 in ?? () from /usr/lib/amdvlk64.so

4 0x00007ffff7dd04fe in ?? () from /usr/lib/libvulkan.so.1

5 0x00007ffff7dd43e9 in ?? () from /usr/lib/libvulkan.so.1

6 0x00007ffff7dd851e in vkCreateInstance () from /usr/lib/libvulkan.so.1

7 0x000055555557490e in ?? ()

8 0x00005555555b9274 in ?? ()

9 0x00005555555612fb in main ()>

I see that a lot of arch users have similar problems, so I wonder if it is an arch specific problem. I currently recompile the package without march=native and O1 lets see if it works this time...

Francesco149 commented 5 years ago

@random2324 I recommend building the driver with debug info so you can get useful stack traces when you crash. if you're using the arch aur package you can do that by changing all Release64 to Debug64 and all Release to Debug in the PKGBUILD, then temporarily remove "strip" from options in /etc/makepkg.conf before running makepkg

also ill post my full dmesg next reboot, sorry for being so slow

random2324 commented 5 years ago

It turned out to be really a compiler issue. I compiled without march=native (in my case haswell) and it works. So tested some other archs and westmere is the last option that works. Sandybridge also segfaults. So the issue seems to be avx.

Francesco149 commented 5 years ago

very interesting, i'll try that later, thanks for testing

Francesco149 commented 5 years ago

I'm on haswell as well by the way so it might be something specific to this arch

random2324 commented 5 years ago

My last test on this and its really because of avx. My cflags are now: CFLAGS="-O2 -pipe -march=native -mno-avx -fstack-protector-strong -fno-plt"

I guess AMD didnt intended to build this with march=native anyway. Wondering if AMD will fix this.

amingriyue commented 5 years ago

Thanks @random2324 .

Is this option ( -march=native) added by yourself? right? I don't find that option by grep.

Could you please make a patch for this compiling issue (-mno-avx) and send to review?

Francesco149 commented 5 years ago

yep it works great with -mno-avx, nice find @random2324

FichteFoll commented 5 years ago

FWIW, I'm running Haswell as well and just building amdvlk-git from the AUR without changes (except when I disabled stripping to debug the segfault).

random2324 commented 5 years ago

Thanks @random2324 .

Is this option ( -march=native) added by yourself? right? I don't find that option by grep.

-march=native wont be probably used by many people. This can be read on arch wiki page https://wiki.archlinux.org/index.php/Makepkg#Building_optimized_binaries But its not the default. The arch aur PKGBUILD disables also other flags: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=amdvlk-git

Maybe AMD could fix this?

Could you please make a patch for this compiling issue (-mno-avx) and send to review?

Well it seems a little bit unclear why this happens for some users and why it doesnt for others. This needs to be investigated.

@FichteFoll Maybe its because of GCC? I use GCC 8.2.1 20181127 I havent tested others.

Francesco149 commented 5 years ago

@FichteFoll is probably not using -march=native

FichteFoll commented 5 years ago

Yeah, I'm using -march=x86-64 -mtune=generic, the default. I missed that changing to -march=native was important here.

If you want me to try building with -march=native and then also with -mno-avx, I can do so, since I seem to have the same hardware generations.

amingriyue commented 5 years ago

Out of curious, how did you set your CFLAGS/CXX_FLAGS? @random2324

I failed to reproduce your issue today with "cmake -H. -Bbuilds/dbg64 -DCMAKE_C_FLAGS=-march=native -DCMAKE_CXX_FLAGS=-march=native", is that enough?

Francesco149 commented 5 years ago

this is what I had:

-march=native -mtune=native -O3 -pipe -fstack-protector-strong -fno-plt

you can try specifically enabling avx with something like -mavx to trigger the issue

random2324 commented 5 years ago

The issue is still there with latest code drop.

Out of curious, how did you set your CFLAGS/CXX_FLAGS? @random2324

I cant really help you here. I use Arch Linux and its build system usually does the job of setting the cflags for you.

However there is some discussion going on here: https://stackoverflow.com/questions/10085945/set-cflags-and-cxxflags-options-using-cmake Maybe that helps.

amingriyue commented 5 years ago

I suggest to add -mno-avx to CFLAGS/CXX_FLAGS first until we fix it.

RarogCmex commented 4 years ago

Hi! Is it fixed now?

justxi commented 4 years ago

@RarogCmex

Hi! Is it fixed now?

The results from my current debugging efforts say no.

jinjianrong commented 3 years ago

Please help create a new issue if anyone still sees the issue

Flakebi commented 3 years ago

This issue is at last fixed in 2021.Q3.1. Compiling with avx and avx2 support is working now.