Monster Hunter World randomly freezes

buscher commented 5 years ago

Monster Hunter World (with proton) randomly freezes. This usually happens in between after 10min to 4hours, so a long random time period.

As the DXVK_HUD (with memory) was enabled, at the time of the freeze, around ~3.9gb (assuming this is vram) of 6gb were used.

Most noticeable the dmesg output:

NVRM: Xid (PCI:0000:09:00): 31, Ch 0000004b, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Xid 31, the addr 0x0_00000000, intr 10000000 and ACCESS_TYPE_READ are always constant.

To me, it looks like a simple nullptr access, as it is always the 0x0 addr, but I don't know how to investigate this problem further. I can not let the game run with VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation or apitrace for hours, as this makes it very unplayable.

PROTON_USE_WINED3D=1 just results in a black screen. Allow flipping (in nvidia-setting) on/off does not change anything.

Please let me know how to make this report more useful, I am out of ideas.

Software information

Monster Hunter World
vsync: off
30fps lock (getting weird input lag otherwise sometimes)
Steam / Proton 3.16-beta5

System information

GPU: Nvidia Geforce 1060gtx 6gb
Driver: nvidia-drivers-415.23
Wine version: Proton 3.16-beta5 (???)
DXVK version: Proton 3.16-beta5 (dxvk 0.93)
Kernel: 4.19.10
Ram: 16gb
CPU: Ryzen 2700X

Log files

(with DXVK_LOG_LEVEL=debug and DXVK_HUD=devinfo,fps,memory)

d3d11.log: MonsterHunterWorld_d3d11.log
dxgi.log: MonsterHunterWorld_dxgi.log

EDIT: The game overall runs pretty well, just the random freezes are a pretty frustrating problem.

EDIT2: The screen freezes but the game background music is still running.

doitsujin commented 5 years ago

I'm aware of this, but I cannot debug this. I don't even know what the dmesg message means exactly and what could possibly cause it, but it's definitely not a null pointer read in dxvk.

Since apitrace isn't going to help here, I would still ask you to run the game with VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation set. Righ now I have nothing to work with at all.

buscher commented 5 years ago

Hm ok, I will try to make something happen with VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation but please tell me where/how/in which file should I see some output for this? In d3d11.log? in dxgi.log? in proton log? on the console/terminal? So that my attempts will not be in vain.

EDIT: and for the record, I assumed a null pointer read on the gpu, not in dxvk, only that dxvk passes a null pointer to the gpu somehow. But I admit my Vulkan/etc knowledge is very limited here.

doitsujin commented 5 years ago

it will write to stdout, so capturing console output should work. It will not appear in the DXVK log files, and I don't know whether the Proton log captures it.

only that dxvk passes a null pointer to the gpu somehow

GPU pointers are hidden behind abstractions, so no, at least not directly.

buscher commented 5 years ago

Just a brief test already shows some errors like:

VUID-VkRenderPassCreateInfo-pDependencies-00837(ERROR / SPEC): msgNum: 0 - Dependency 1 specifies a source stage mask that contains stages not in the GRAPHICS pipeline as used by the source subpass 0. The Vulkan spec states: For any element of pDependencies, if the srcSubpass is not VK_SUBPASS_EXTERNAL, all stage flags included in the srcStageMask member of that dependency must be a pipeline stage supported by the pipeline identified by the pipelineBindPoint member of the source subpass. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkRenderPassCreateInfo-pDependencies-00837)
    Objects: 1
       [0] 0x0, type: 0, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-VkRenderPassCreateInfo-pDependencies-00837 ] Object: VK_NULL_HANDLE (Type = 0) | Dependency 1 specifies a source stage mask that contains stages not in the GRAPHICS pipeline as used by the source subpass 0. The Vulkan spec states: For any element of pDependencies, if the srcSubpass is not VK_SUBPASS_EXTERNAL, all stage flags included in the srcStageMask member of that dependency must be a pipeline stage supported by the pipeline identified by the pipelineBindPoint member of the source subpass. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkRenderPassCreateInfo-pDependencies-00837)
VUID-VkSubpassDependency-srcAccessMask-00868(ERROR / SPEC): msgNum: 0 - vkCreateRenderPass(): pDependencies[3].srcAccessMask (0xa000540) is not supported by srcStageMask (0x8000). The Vulkan spec states: Any access flag included in srcAccessMask must be supported by one of the pipeline stages in srcStageMask, as specified in the table of supported access types (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkSubpassDependency-srcAccessMask-00868)
    Objects: 1
       [0] 0x0, type: 0, name: (null)

but NOTE the game did not freeze yet! They might not be related to the freeze itself. MonsterHunterWorld_VK_LAYER_LUNARG_standard_validation.log

Tomorrow I will try to make it until the game freezes.

doitsujin commented 5 years ago

Those are harmless and occur in every game because of some transform feedback issue.

buscher commented 5 years ago

Log with game frozen: MonsterHunterWorld_VK_LAYER_LUNARG_standard_validation_long.log

VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)
    Objects: 1
       [0] 0xce03, type: 22, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkResetDescriptorPool-descriptorPool-00313 ] Object: 0xce03 (Type = 22) | It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)

I hope this helps. :)

The rest

VUID-vkDestroyFramebuffer-framebuffer-00892(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroyFramebuffer on Framebuffer 0x4bf461f that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    Objects: 1
       [0] 0x4bf461f, type: 24, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object: 0x4bf461f (Type = 24) | Cannot call vkDestroyFramebuffer on Framebuffer 0x4bf461f that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)

...

VUID-vkDestroyBufferView-bufferView-00936(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroyBufferView on BufferView 0x4bf419a that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to bufferView must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyBufferView-bufferView-00936)
    Objects: 1
       [0] 0x4bf419a, type: 13, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkDestroyBufferView-bufferView-00936 ] Object: 0x4bf419a (Type = 13) | Cannot call vkDestroyBufferView on BufferView 0x4bf419a that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to bufferView must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyBufferView-bufferView-00936)

I guess is only because I killed the process (kill -9 pid) after it was stuck for 1+min.

Also new, but does not seem critical

UNASSIGNED-CoreValidation-Shader-InterfaceTypeMismatch(ERROR / SPEC): msgNum: 0 - Decoration mismatch on location 30.0: is per-patch in tessellation control shader stage but per-vertex in tessellation evaluation shader stage
    Objects: 1
       [0] 0xd2e, type: 15, name: (null)
Validation(ERROR): msg_code: 0:  [ UNASSIGNED-CoreValidation-Shader-InterfaceTypeMismatch ] Object: 0xd2e (Type = 15) | Decoration mismatch on location 30.0: is per-patch in tessellation control shader stage but per-vertex in tessellation evaluation shader stage

EDIT: new at the end of dxgi.log:

err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST

new at the end of d3d11.log:

err:   DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
err:   DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
err:   DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST

doitsujin commented 5 years ago

When do the VK_ERROR_DEVICE_LOST issues start happening? I would suspect that the following is actually caused by those errors, and not causing them:

VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)

buscher commented 5 years ago

I honestly can't tell, there are no timestamps on this logs. I happens too quickly so I guess around the same time.

Any way to add timestamp? or prove it? Or want me to test something else?

buscher commented 5 years ago

Further observation:

DXVK_STATE_CACHE=0 seems to make it worse, tried it 3 times and it froze within the first 15min. (But take it with a grain of salt, might be unrelated)
using DXVK_HUD, stats as it froze
- Geforce GTX 1060 6GB
- Driver: 415.23.0
- Vulkan: 1.1.84
- FPS: 30.0
- min: 9.7 max: 57.0
- Queue submissions: 5
- Draw calls: 1310
- Dispatch calls: 149
- Render passes: 127
- Graphics pipelines: 536
- Compute pipelines: 140
- Memory allocated: 2948 MB
- Memory used: 2779 MB
also tried VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump but can not get it working. EDIT: I managed to get it working, but it produces a 1+gb file just going to the menu and it gives me like 1fps, really unplayanle this time. **EDIT2:*** VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_vktrace + vktrace -o mhw.vktrace produces a huge output file as well, but with slightly better performance, but I can not vkreplay this file.
after the freeze nvidia-smi reported 0% GPU usage for me (as I have seen reports where it is stuck with 100%)

buscher commented 5 years ago

But I found another bug, which might taint some of my reports so far.

The Game crashes when going to an F1 terminal. Reproduce:

Open the Game
Wait for it to load until you are in the Menu
Press "Ctrl + Alt + F1" -> crash in =>0 0x00007f0d31a42e09 in libnvidia-glcore.so.415.25 (+0x11a0e09) (0x00007f0d2c84b4a0) steam-582010_dump_v5.log

This might invalidate my "after the freeze nvidia-smi reports 0% GPU usage" comment.

But it also could be caused by the new:

DXVK version: Proton 3.16-beta6 (dxvk 0.94)

or

Driver: nvidia-drivers-415.25

Emanem commented 5 years ago

@doitsujin Happy to help, let me know how one can contribute. I have a 1080 GTX (nvidia 415.25) and of course I am victim of the same bug. It could also be a drivers' bug quite frankly... but not sure.

Is there a way to trace the API and then try to save the trace file for you? Although the file may be gigantic! :)

ahjolinna commented 5 years ago

I also had this bug, and here is my log (with proton 3.16-6-beta) until the next proton release arrives.

on this one I played for few hours and then went to take a small nap and let game run (+ "login") to check if this bug would occur even doing nothing...and it did. can't say if that log can help you but there it. if there is anything I can do give you better report I will.

I will try to replicate the bug on next proton update...whenever it arrives

my system spec:

inxi -b
System:    Host: linux Kernel: 4.20.6-1-default x86_64 bits: 64 Desktop: KDE Plasma 5.14.5 
           Distro: openSUSE Tumbleweed 20190209 
Machine:   Type: Desktop Mobo: ASUSTeK model: Z170 PRO GAMING v: Rev X.0x serial: <root required> UEFI: American Megatrends 
           v: 3805 date: 05/16/2018 
CPU:       Quad Core: Intel Core i5-6600K type: MCP speed: 4374 MHz min/max: 800/4400 MHz 
Graphics:  Device-1: NVIDIA GM204 [GeForce GTX 970] driver: nvidia v: 410.93 
           Display: x11 server: X.Org 1.20.3 driver: nvidia resolution: 1920x1080~60Hz, 1920x1080~60Hz 
           OpenGL: renderer: GeForce GTX 970/PCIe/SSE2 v: 4.6.0 NVIDIA 410.93 
Network:   Device-1: Intel Ethernet I219-V driver: e1000e 
Info:      Processes: 443 Uptime: 02:47:03  up 6 days  3:15,  3 users,  load average: 0.70, 1.07, 1.39 Memory: 15.60 GiB 
           used: 6.94 GiB (44.5%) Shell: bash inxi: 3.0.30

Xaenalt commented 5 years ago

Just came back to the game, I'll see what I can do to produce some logs, can confirm that the same error occurs on the latest stable proton. Nvidia's website recommends cuda-memcheck or cuda-gdb, but I haven't had any luck getting cuda-gdb to attach properly (granted, that's my first attempt at gdb, so I might be missing something there).

I'll see what I can come up with for VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump and VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_vktrace tomorrow too

ljn917 commented 5 years ago

This is a note for the xid message. As documented here, xid 31 is GPU memory page fault, usually invalid memory access. In this case, it looks like a null pointer read.

For debugging with cuda-gdb, it depends on the number of gpus... Please see this.

It is probably possible to dump the core as instructed here

BTW, I am not an expert in GPU/DXVK... I am still learning from nvidia's manual. Feel free to correct me if I am wrong.

Xaenalt commented 5 years ago

Managed to get cuda-gdb attached to it, fingers crossed!

To get cuda-gdb attached, you'll need to do the following:

Start MHW and get into the game proper. The early menu screens will crash if you attach early
ps aux | grep MonsterHunterWorld.exe # Note the PID of the actual executable
cuda-gdb
# The rest of these inside the cuda-gdb shell
handle SIGUSR1 nostop noprint
handle SIGQUIT nostop noprint
set cuda api_failures stop
attach <mhw pid from above>
continue

At this point the game will run, and hopefully will give us a nice backtrace when the null pointer deref occurs

ljn917 commented 5 years ago

@Xaenalt Could you please test disabling the nvapi hack and see if things improve? Put "dxgi.nvapiHack = False" in dxvk.conf, and add DXVK_CONFIG_FILE=/path/to/dxvk.conf to the launcher command line.

Xaenalt commented 5 years ago

Will do!

Xaenalt commented 5 years ago

No change with dxgi.nvapiHack = False, crash still occurs

Xaenalt commented 5 years ago

Regular log from the crash while I attempt to get the SDK set up steam-582010.log

HK47196 commented 5 years ago

Is this a driver bug? Has a bug report been submitted to Nvidia?

Xaenalt commented 5 years ago

We'll know soon I hope. I got the api_trace working, and a 2TB drive to hold the log. Worst case I'm gonna leave it overnight and hope the error is in there

Xaenalt commented 5 years ago

AHA, gotcha! Log uploading now! :D (it'll be pretty big, I'm gzipping it to try to reduce size, but expect a long gunzip) We lose the GPU on frame 13485 with the VK_ERROR_DEVICE_LOST (-4) error. I have a hotkey to kill -9 the game, which kicks in 2 frames later. I gave it a few minutes of wall clock time, since frames were on average taking maybe half a second

Xaenalt commented 5 years ago

33G to 1G, wow, that compressed really well o.O

It's too big to upload to github directly, but I threw it into my gdrive, let me know if you have any issues pulling it: https://drive.google.com/file/d/1SHxowR6NZlSlUC4sm40o6m5WPk89OoXg/view?usp=sharing

Xaenalt commented 5 years ago

I may be wrong, I'm no expert, but it seems like the semaphore at 0x16d4fc10 might be getting overwritten? It looks like some buffer copies target that area too. That might be expected behavior, idk

Emanem commented 5 years ago

Hopefully @doitsujin will be able to take a peep at it? :)

Xaenalt commented 5 years ago

Fingers crossed that the error is in plain sight in there :)

Xaenalt commented 5 years ago

Any way we can help track down what's causing the lost device?

doitsujin commented 5 years ago

Does this still happen with the fixed shader and the latest driver? There's an admittedly small chance that the bad shader caused these hangs in the first place.

There doesn't really seem to be anything wrong apart from it, and these hangs do seem to be specific to Nvidia.

Xaenalt commented 5 years ago

Testing now with that patch you just provided in the other bug (https://github.com/doitsujin/dxvk/issues/930 if anyone else wants to try as well). Should I grab another API trace if it encounters the hang?

Xaenalt commented 5 years ago

@doitsujin I'm sorry to say, it just happened with the version you sent me

Xaenalt commented 5 years ago

It seems to hang less frequently though, in the past 2 tests, it seems like it took a lot longer for the hang to happen. I'm going to keep testing, might just be my imagination

Xaenalt commented 5 years ago

Still get a few early on, I'll try to get the api trace from one. In the meantime, here's a shader dump https://drive.google.com/file/d/19MIcBdoZp6V8PPWstbbQvnJoXwSw3hR6/view?usp=sharing

Xaenalt commented 5 years ago

Just doing some additional testing, setting d3d11.zeroWorkgroupMemory = True will cause the lockup to not lock up the entire system, still occurs though, and the fps takes a big hit

Xaenalt commented 5 years ago

Just on a hunch, also trying with dxvk.numCompilerThreads = 1 just in case, will report if it crashes with that

ljn917 commented 5 years ago

I tried it already. It will crash.

On Tue, Feb 26, 2019, 18:50 Sean Pryor notifications@github.com wrote:

Just on a hunch, also trying with dxvk.numCompilerThreads = 1 just in case, will report if it crashes with that

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/doitsujin/dxvk/issues/816#issuecomment-467663470, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY5npcoS0JkYIelBf2n9kdJ4S_WEyj8ks5vRcgvgaJpZM4ZW911 .

HK47196 commented 5 years ago

Last time I tried using the validation layers I didn't get anything interesting, this time when it froze I received

VUID-vkDestroyCommandPool-commandPool-00041(ERROR / SPEC): msgNum: 0 - Attempt to destroy command pool with command buffer (0x7d41d750) which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyCommandPool-commandPool-00041) Objects: 1 [0] 0x7d41d750, type: 6, name: NULL VUID-vkDestroyFence-fence-01120(ERROR / SPEC): msgNum: 0 - Fence 0x23b804 is in use. The Vulkan spec states: All queue submission commands that refer to fence must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFence-fence-01120) Objects: 1 [0] 0x23b804, type: 7, name: NULL

Followed by a flood of

VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313) Objects: 1 [0] 0xd3, type: 22, name: NULL

Unsure if this is important, but thought I'd report it anyways.

Xaenalt commented 5 years ago

Yep, does indeed still crash with the same issue. I wonder if there's any extra debugging info we can add to track it down further, I put the API dump in an earlier comment, which it looks like didn't have any clues

doitsujin commented 5 years ago

@rsw0x that happens because of the error, but doesn't cause it. DXVK still doesn't handle DEVICE_LOST errors properly.

Xaenalt commented 5 years ago

Are device lost errors something recoverable? Or moreover, are they something normal that if handled would allow the game to continue?

YvanDaSilva commented 5 years ago

Device lost might just mean that the Driver crashed. Most games don't handle this well. Mostly benchmarks do that because they know either users are trying unstable OC or unstable drivers.

valarnin commented 5 years ago

I had this problem previously. Just updated to the latest nvidia-drivers ebuild on Gentoo: x11-drivers/nvidia-drivers-418.43::gentoo was built with the following: USE="X acpi driver gtk3 kms multilib tools -compat -static-libs -uvm -wayland" ABI_X86="32 (64) (-x32)"

Running this kernel: Linux localhost 4.20.13-gentoo #2 SMP Thu Feb 28 20:13:14 EST 2019 x86_64 Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz GenuineIntel GNU/Linux

Latest Steam proton beta (3.16-7).

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
| 37%   59C    P0   130W / 180W |   3164MiB /  8119MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 970     Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   27C    P8    12W / 201W |      1MiB /  4043MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3273      G   /usr/libexec/Xorg                            291MiB |
|    0      3516      G   /usr/bin/gnome-shell                         171MiB |
|    0      4169      G   ...uest-channel-token=17976434344080092270    51MiB |
|    0     19152      G   ...in/.local/share/Steam/ubuntu12_32/steam    33MiB |
|    0     19161      G   ./steamwebhelper                               3MiB |
|    0     20860    C+G   ...ter Hunter World\MonsterHunterWorld.exe  2495MiB |
|    0     22146      G   ...quest-channel-token=1348227650674135017   113MiB |
+-----------------------------------------------------------------------------+

I've had MHW running for an hour and a half now without any freezes. I'm going to let it run overnight and see if it freezes up, just looking out over the ocean.

If it DOES freeze up eventually, I have two nvidia cards I can test with on identical hardware otherwise. This is running on a GTX 1070, but I also have a GTX 970 I was using previously.

The GTX 970 usually froze up within 45 minutes to an hour of starting.

valarnin commented 5 years ago

Looks like it froze just after the 4 hour mark overnight. Same Xid dmesg log as buscher above.

Is there anything I can do to help diagnose further?

HK47196 commented 5 years ago

latest vulkan driver changelog:

Fixed a bug which could cause the compiler to crash in some Vulkan games

https://developer.nvidia.com/vulkan-driver

Any changes?

valarnin commented 5 years ago

latest vulkan driver changelog:

Fixed a bug which could cause the compiler to crash in some Vulkan games

https://developer.nvidia.com/vulkan-driver

Any changes?

Not sure if I'm not understanding the way nvidia does version numbering, or if I already have that update? My nvidia-smi shows I'm running 418.43, whereas the latest version on that page is 418.42.02. I'll switch to that version and test again overnight, given the ~4 hour run time until it freezes it pretty much has to be an overnight test for me.

valarnin commented 5 years ago

Same result with 418.42.02. Started at 1551910780.835805, froze at 1551931937.174271, just under 6 hours of runtime.

Shotweb commented 5 years ago

I have this problem with Battlefield 1. System freezes randomly, sometimes after many hours, and I have to kill X to get the frozen image away. Happens not only in game, but during loading screens and menus too. Running KDE Neon 18.04 with nvidia-driver-418-418.43, with Wine 4.3 and DXVK 1.0. My dmesg messages are [ 2980.508578] NVRM: GPU at PCI:0000:01:00: GPU-a919130d-9a04-dbf1-19c0-c827155af29b

[ 2980.508582] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000054, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x1_5d1ee000. Fault is of type FAULT_PTE ACCESS_TYPE_READ Let me know if there's more info I can provide

Xaenalt commented 5 years ago

Something about the Kulve Taroth fight makes it crash much more frequently @_@

Emanem commented 5 years ago

Something about the Kulve Taroth fight makes it crash much more frequently @_@

Since I updated drivers to 418 and have a new videocard (2080 Ti), I never experienced this infamous crash once (used to have a 1080 GTX and the crash was happening every ~1.5 hours on average).

Xaenalt commented 5 years ago

That's interesting, I'm on the 1080ti, with driver 418.43, I wonder if there's some ray tracing function that is being used that avoids this issue

Can you give us an inxi -b to tell us a bit more about the system?

valarnin commented 5 years ago

I don't have any real proof one way or another, but this feels like a memory/handle issue. Is memory fragmentation a thing for GPU memory, could there be 2GB of VRAM free but no large contiguous chunks to allocate for a given texture/shader/etc, causing that allocation to fail and return null?

This would explain why more VRAM seems to let the game last longer before freezing, even though VRAM usage doesn't actually seem to leak.

doitsujin / dxvk