Closed buscher closed 4 years ago
I'm aware of this, but I cannot debug this. I don't even know what the dmesg message means exactly and what could possibly cause it, but it's definitely not a null pointer read in dxvk.
Since apitrace isn't going to help here, I would still ask you to run the game with VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation
set. Righ now I have nothing to work with at all.
Hm ok, I will try to make something happen with
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation
but please tell me where/how/in which file should I see some output for this?
In d3d11.log? in dxgi.log? in proton log? on the console/terminal?
So that my attempts will not be in vain.
EDIT: and for the record, I assumed a null pointer read on the gpu, not in dxvk, only that dxvk passes a null pointer to the gpu somehow. But I admit my Vulkan/etc knowledge is very limited here.
it will write to stdout
, so capturing console output should work. It will not appear in the DXVK log files, and I don't know whether the Proton log captures it.
only that dxvk passes a null pointer to the gpu somehow
GPU pointers are hidden behind abstractions, so no, at least not directly.
Just a brief test already shows some errors like:
VUID-VkRenderPassCreateInfo-pDependencies-00837(ERROR / SPEC): msgNum: 0 - Dependency 1 specifies a source stage mask that contains stages not in the GRAPHICS pipeline as used by the source subpass 0. The Vulkan spec states: For any element of pDependencies, if the srcSubpass is not VK_SUBPASS_EXTERNAL, all stage flags included in the srcStageMask member of that dependency must be a pipeline stage supported by the pipeline identified by the pipelineBindPoint member of the source subpass. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkRenderPassCreateInfo-pDependencies-00837)
Objects: 1
[0] 0x0, type: 0, name: (null)
Validation(ERROR): msg_code: 0: [ VUID-VkRenderPassCreateInfo-pDependencies-00837 ] Object: VK_NULL_HANDLE (Type = 0) | Dependency 1 specifies a source stage mask that contains stages not in the GRAPHICS pipeline as used by the source subpass 0. The Vulkan spec states: For any element of pDependencies, if the srcSubpass is not VK_SUBPASS_EXTERNAL, all stage flags included in the srcStageMask member of that dependency must be a pipeline stage supported by the pipeline identified by the pipelineBindPoint member of the source subpass. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkRenderPassCreateInfo-pDependencies-00837)
VUID-VkSubpassDependency-srcAccessMask-00868(ERROR / SPEC): msgNum: 0 - vkCreateRenderPass(): pDependencies[3].srcAccessMask (0xa000540) is not supported by srcStageMask (0x8000). The Vulkan spec states: Any access flag included in srcAccessMask must be supported by one of the pipeline stages in srcStageMask, as specified in the table of supported access types (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkSubpassDependency-srcAccessMask-00868)
Objects: 1
[0] 0x0, type: 0, name: (null)
but NOTE the game did not freeze yet! They might not be related to the freeze itself. MonsterHunterWorld_VK_LAYER_LUNARG_standard_validation.log
Tomorrow I will try to make it until the game freezes.
Those are harmless and occur in every game because of some transform feedback issue.
Log with game frozen: MonsterHunterWorld_VK_LAYER_LUNARG_standard_validation_long.log
VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)
Objects: 1
[0] 0xce03, type: 22, name: (null)
Validation(ERROR): msg_code: 0: [ VUID-vkResetDescriptorPool-descriptorPool-00313 ] Object: 0xce03 (Type = 22) | It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)
I hope this helps. :)
The rest
VUID-vkDestroyFramebuffer-framebuffer-00892(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroyFramebuffer on Framebuffer 0x4bf461f that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
Objects: 1
[0] 0x4bf461f, type: 24, name: (null)
Validation(ERROR): msg_code: 0: [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object: 0x4bf461f (Type = 24) | Cannot call vkDestroyFramebuffer on Framebuffer 0x4bf461f that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
...
VUID-vkDestroyBufferView-bufferView-00936(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroyBufferView on BufferView 0x4bf419a that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to bufferView must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyBufferView-bufferView-00936)
Objects: 1
[0] 0x4bf419a, type: 13, name: (null)
Validation(ERROR): msg_code: 0: [ VUID-vkDestroyBufferView-bufferView-00936 ] Object: 0x4bf419a (Type = 13) | Cannot call vkDestroyBufferView on BufferView 0x4bf419a that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to bufferView must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyBufferView-bufferView-00936)
I guess is only because I killed the process (kill -9 pid) after it was stuck for 1+min.
Also new, but does not seem critical
UNASSIGNED-CoreValidation-Shader-InterfaceTypeMismatch(ERROR / SPEC): msgNum: 0 - Decoration mismatch on location 30.0: is per-patch in tessellation control shader stage but per-vertex in tessellation evaluation shader stage
Objects: 1
[0] 0xd2e, type: 15, name: (null)
Validation(ERROR): msg_code: 0: [ UNASSIGNED-CoreValidation-Shader-InterfaceTypeMismatch ] Object: 0xd2e (Type = 15) | Decoration mismatch on location 30.0: is per-patch in tessellation control shader stage but per-vertex in tessellation evaluation shader stage
EDIT: new at the end of dxgi.log:
err: DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err: DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err: DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err: DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err: DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err: DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
new at the end of d3d11.log:
err: DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
err: DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
err: DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
When do the VK_ERROR_DEVICE_LOST
issues start happening? I would suspect that the following is actually caused by those errors, and not causing them:
VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)
I honestly can't tell, there are no timestamps on this logs. I happens too quickly so I guess around the same time.
Any way to add timestamp? or prove it? Or want me to test something else?
Further observation:
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump
but can not get it working.
EDIT: I managed to get it working, but it produces a 1+gb file just going to the menu and it gives me like 1fps, really unplayanle this time.
**EDIT2:*** VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_vktrace
+ vktrace -o mhw.vktrace
produces a huge output file as well, but with slightly better performance, but I can not vkreplay this file.nvidia-smi
reported 0% GPU usage for me (as I have seen reports where it is stuck with 100%)But I found another bug, which might taint some of my reports so far.
The Game crashes when going to an F1 terminal. Reproduce:
This might invalidate my "after the freeze nvidia-smi
reports 0% GPU usage" comment.
But it also could be caused by the new:
or
@doitsujin Happy to help, let me know how one can contribute. I have a 1080 GTX (nvidia 415.25) and of course I am victim of the same bug. It could also be a drivers' bug quite frankly... but not sure.
Is there a way to trace the API and then try to save the trace file for you? Although the file may be gigantic! :)
I also had this bug, and here is my log (with proton 3.16-6-beta) until the next proton release arrives.
on this one I played for few hours and then went to take a small nap and let game run (+ "login") to check if this bug would occur even doing nothing...and it did. can't say if that log can help you but there it. if there is anything I can do give you better report I will.
I will try to replicate the bug on next proton update...whenever it arrives
my system spec:
inxi -b
System: Host: linux Kernel: 4.20.6-1-default x86_64 bits: 64 Desktop: KDE Plasma 5.14.5
Distro: openSUSE Tumbleweed 20190209
Machine: Type: Desktop Mobo: ASUSTeK model: Z170 PRO GAMING v: Rev X.0x serial: <root required> UEFI: American Megatrends
v: 3805 date: 05/16/2018
CPU: Quad Core: Intel Core i5-6600K type: MCP speed: 4374 MHz min/max: 800/4400 MHz
Graphics: Device-1: NVIDIA GM204 [GeForce GTX 970] driver: nvidia v: 410.93
Display: x11 server: X.Org 1.20.3 driver: nvidia resolution: 1920x1080~60Hz, 1920x1080~60Hz
OpenGL: renderer: GeForce GTX 970/PCIe/SSE2 v: 4.6.0 NVIDIA 410.93
Network: Device-1: Intel Ethernet I219-V driver: e1000e
Info: Processes: 443 Uptime: 02:47:03 up 6 days 3:15, 3 users, load average: 0.70, 1.07, 1.39 Memory: 15.60 GiB
used: 6.94 GiB (44.5%) Shell: bash inxi: 3.0.30
Just came back to the game, I'll see what I can do to produce some logs, can confirm that the same error occurs on the latest stable proton. Nvidia's website recommends cuda-memcheck or cuda-gdb, but I haven't had any luck getting cuda-gdb to attach properly (granted, that's my first attempt at gdb, so I might be missing something there).
I'll see what I can come up with for VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump
and VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_vktrace
tomorrow too
This is a note for the xid message. As documented here, xid 31 is GPU memory page fault, usually invalid memory access. In this case, it looks like a null pointer read.
For debugging with cuda-gdb, it depends on the number of gpus... Please see this.
It is probably possible to dump the core as instructed here
BTW, I am not an expert in GPU/DXVK... I am still learning from nvidia's manual. Feel free to correct me if I am wrong.
Managed to get cuda-gdb attached to it, fingers crossed!
To get cuda-gdb attached, you'll need to do the following:
Start MHW and get into the game proper. The early menu screens will crash if you attach early
ps aux | grep MonsterHunterWorld.exe # Note the PID of the actual executable
cuda-gdb
# The rest of these inside the cuda-gdb shell
handle SIGUSR1 nostop noprint
handle SIGQUIT nostop noprint
set cuda api_failures stop
attach <mhw pid from above>
continue
At this point the game will run, and hopefully will give us a nice backtrace when the null pointer deref occurs
@Xaenalt Could you please test disabling the nvapi hack and see if things improve? Put "dxgi.nvapiHack = False" in dxvk.conf, and add DXVK_CONFIG_FILE=/path/to/dxvk.conf to the launcher command line.
Will do!
No change with dxgi.nvapiHack = False
, crash still occurs
Regular log from the crash while I attempt to get the SDK set up steam-582010.log
Is this a driver bug? Has a bug report been submitted to Nvidia?
We'll know soon I hope. I got the api_trace working, and a 2TB drive to hold the log. Worst case I'm gonna leave it overnight and hope the error is in there
AHA, gotcha! Log uploading now! :D (it'll be pretty big, I'm gzipping it to try to reduce size, but expect a long gunzip) We lose the GPU on frame 13485 with the VK_ERROR_DEVICE_LOST (-4) error. I have a hotkey to kill -9 the game, which kicks in 2 frames later. I gave it a few minutes of wall clock time, since frames were on average taking maybe half a second
33G to 1G, wow, that compressed really well o.O
It's too big to upload to github directly, but I threw it into my gdrive, let me know if you have any issues pulling it: https://drive.google.com/file/d/1SHxowR6NZlSlUC4sm40o6m5WPk89OoXg/view?usp=sharing
I may be wrong, I'm no expert, but it seems like the semaphore at 0x16d4fc10 might be getting overwritten? It looks like some buffer copies target that area too. That might be expected behavior, idk
Hopefully @doitsujin will be able to take a peep at it? :)
Fingers crossed that the error is in plain sight in there :)
Any way we can help track down what's causing the lost device?
Does this still happen with the fixed shader and the latest driver? There's an admittedly small chance that the bad shader caused these hangs in the first place.
There doesn't really seem to be anything wrong apart from it, and these hangs do seem to be specific to Nvidia.
Testing now with that patch you just provided in the other bug (https://github.com/doitsujin/dxvk/issues/930 if anyone else wants to try as well). Should I grab another API trace if it encounters the hang?
@doitsujin I'm sorry to say, it just happened with the version you sent me
It seems to hang less frequently though, in the past 2 tests, it seems like it took a lot longer for the hang to happen. I'm going to keep testing, might just be my imagination
Still get a few early on, I'll try to get the api trace from one. In the meantime, here's a shader dump https://drive.google.com/file/d/19MIcBdoZp6V8PPWstbbQvnJoXwSw3hR6/view?usp=sharing
Just doing some additional testing, setting d3d11.zeroWorkgroupMemory = True will cause the lockup to not lock up the entire system, still occurs though, and the fps takes a big hit
Just on a hunch, also trying with dxvk.numCompilerThreads = 1 just in case, will report if it crashes with that
I tried it already. It will crash.
On Tue, Feb 26, 2019, 18:50 Sean Pryor notifications@github.com wrote:
Just on a hunch, also trying with dxvk.numCompilerThreads = 1 just in case, will report if it crashes with that
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/doitsujin/dxvk/issues/816#issuecomment-467663470, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY5npcoS0JkYIelBf2n9kdJ4S_WEyj8ks5vRcgvgaJpZM4ZW911 .
Last time I tried using the validation layers I didn't get anything interesting, this time when it froze I received
VUID-vkDestroyCommandPool-commandPool-00041(ERROR / SPEC): msgNum: 0 - Attempt to destroy command pool with command buffer (0x7d41d750) which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyCommandPool-commandPool-00041) Objects: 1 [0] 0x7d41d750, type: 6, name: NULL VUID-vkDestroyFence-fence-01120(ERROR / SPEC): msgNum: 0 - Fence 0x23b804 is in use. The Vulkan spec states: All queue submission commands that refer to fence must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFence-fence-01120) Objects: 1 [0] 0x23b804, type: 7, name: NULL
Followed by a flood of
VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313) Objects: 1 [0] 0xd3, type: 22, name: NULL
Unsure if this is important, but thought I'd report it anyways.
Yep, does indeed still crash with the same issue. I wonder if there's any extra debugging info we can add to track it down further, I put the API dump in an earlier comment, which it looks like didn't have any clues
@rsw0x that happens because of the error, but doesn't cause it. DXVK still doesn't handle DEVICE_LOST
errors properly.
Are device lost errors something recoverable? Or moreover, are they something normal that if handled would allow the game to continue?
Device lost might just mean that the Driver crashed. Most games don't handle this well. Mostly benchmarks do that because they know either users are trying unstable OC or unstable drivers.
I had this problem previously. Just updated to the latest nvidia-drivers ebuild on Gentoo:
x11-drivers/nvidia-drivers-418.43::gentoo was built with the following: USE="X acpi driver gtk3 kms multilib tools -compat -static-libs -uvm -wayland" ABI_X86="32 (64) (-x32)"
Running this kernel:
Linux localhost 4.20.13-gentoo #2 SMP Thu Feb 28 20:13:14 EST 2019 x86_64 Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz GenuineIntel GNU/Linux
Latest Steam proton beta (3.16-7).
nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| 37% 59C P0 130W / 180W | 3164MiB / 8119MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 970 Off | 00000000:02:00.0 Off | N/A |
| 0% 27C P8 12W / 201W | 1MiB / 4043MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3273 G /usr/libexec/Xorg 291MiB |
| 0 3516 G /usr/bin/gnome-shell 171MiB |
| 0 4169 G ...uest-channel-token=17976434344080092270 51MiB |
| 0 19152 G ...in/.local/share/Steam/ubuntu12_32/steam 33MiB |
| 0 19161 G ./steamwebhelper 3MiB |
| 0 20860 C+G ...ter Hunter World\MonsterHunterWorld.exe 2495MiB |
| 0 22146 G ...quest-channel-token=1348227650674135017 113MiB |
+-----------------------------------------------------------------------------+
I've had MHW running for an hour and a half now without any freezes. I'm going to let it run overnight and see if it freezes up, just looking out over the ocean.
If it DOES freeze up eventually, I have two nvidia cards I can test with on identical hardware otherwise. This is running on a GTX 1070, but I also have a GTX 970 I was using previously.
The GTX 970 usually froze up within 45 minutes to an hour of starting.
Looks like it froze just after the 4 hour mark overnight. Same Xid
dmesg log as buscher above.
Is there anything I can do to help diagnose further?
latest vulkan driver changelog:
Fixed a bug which could cause the compiler to crash in some Vulkan games
https://developer.nvidia.com/vulkan-driver
Any changes?
latest vulkan driver changelog:
Fixed a bug which could cause the compiler to crash in some Vulkan games
https://developer.nvidia.com/vulkan-driver
Any changes?
Not sure if I'm not understanding the way nvidia does version numbering, or if I already have that update? My nvidia-smi
shows I'm running 418.43
, whereas the latest version on that page is 418.42.02
. I'll switch to that version and test again overnight, given the ~4 hour run time until it freezes it pretty much has to be an overnight test for me.
Same result with 418.42.02
. Started at 1551910780.835805
, froze at 1551931937.174271
, just under 6 hours of runtime.
I have this problem with Battlefield 1. System freezes randomly, sometimes after many hours, and I have to kill X to get the frozen image away. Happens not only in game, but during loading screens and menus too.
Running KDE Neon 18.04 with nvidia-driver-418-418.43, with Wine 4.3 and DXVK 1.0.
My dmesg messages are [ 2980.508578] NVRM: GPU at PCI:0000:01:00: GPU-a919130d-9a04-dbf1-19c0-c827155af29b
[ 2980.508582] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000054, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x1_5d1ee000. Fault is of type FAULT_PTE ACCESS_TYPE_READ
Let me know if there's more info I can provide
Something about the Kulve Taroth fight makes it crash much more frequently @_@
Something about the Kulve Taroth fight makes it crash much more frequently @_@
Since I updated drivers to 418 and have a new videocard (2080 Ti), I never experienced this infamous crash once (used to have a 1080 GTX and the crash was happening every ~1.5 hours on average).
That's interesting, I'm on the 1080ti, with driver 418.43, I wonder if there's some ray tracing function that is being used that avoids this issue
Can you give us an inxi -b to tell us a bit more about the system?
I don't have any real proof one way or another, but this feels like a memory/handle issue. Is memory fragmentation a thing for GPU memory, could there be 2GB of VRAM free but no large contiguous chunks to allocate for a given texture/shader/etc, causing that allocation to fail and return null?
This would explain why more VRAM seems to let the game last longer before freezing, even though VRAM usage doesn't actually seem to leak.
Monster Hunter World (with proton) randomly freezes. This usually happens in between after 10min to 4hours, so a long random time period.
As the DXVK_HUD (with memory) was enabled, at the time of the freeze, around ~3.9gb (assuming this is vram) of 6gb were used.
Most noticeable the dmesg output:
Xid 31, the addr 0x0_00000000, intr 10000000 and ACCESS_TYPE_READ are always constant.
To me, it looks like a simple nullptr access, as it is always the 0x0 addr, but I don't know how to investigate this problem further. I can not let the game run with
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation
or apitrace for hours, as this makes it very unplayable.PROTON_USE_WINED3D=1
just results in a black screen. Allow flipping (in nvidia-setting) on/off does not change anything.Please let me know how to make this report more useful, I am out of ideas.
Software information
System information
Log files
(with DXVK_LOG_LEVEL=debug and DXVK_HUD=devinfo,fps,memory)
EDIT: The game overall runs pretty well, just the random freezes are a pretty frustrating problem.
EDIT2: The screen freezes but the game background music is still running.