doitsujin / dxvk

Vulkan-based implementation of D3D8, 9, 10 and 11 for Linux / Wine
zlib License
13.28k stars 854 forks source link

DxvkMemoryAllocator: Memory allocation failed #747

Closed kakra closed 5 years ago

kakra commented 6 years ago

Software information

The Witcher 3, all settings maxed out, full HD, Nvidia Hairworks all characters + AA4

System information

Log files

After loading a saved game, the game freezes just milliseconds after starting to fade in the screen. Since the game still fades in, everything is dark but it looks like everything is rendered correctly - no models or textures missing, NV hairs are also working. This happens only sometimes.

Looking at the logs I see

terminate called after throwing an instance of 'Dxvk::DxvkError'

Turning on full debug logging of dxvk eliminates the issue, it's not longer reproducible.

The frozen game can be successfully and instantly killed with SIGKILL.

doitsujin commented 6 years ago

Please apply the following patch to DXVK to get more descriptive error output: dxvk-error.patch.txt

I've never seen this problem or anything like it, and I test Witcher 3 a lot. With the current set of information I won't be able to do anything, though.

kakra commented 6 years ago

Thanks, I'll try during the next days. I never saw this behavior in the v0.80 series of DXVK.

SveSop commented 6 years ago

I haven't noticed this with the "Beta 3.16" Proton version tho. Afaik that uses dxvk-0.90...

Atm 3.16-4 Beta, that I would guess is the release called proton-3.16beta-20181031

kakra commented 6 years ago

@SveSop I'm currently working with bleeding edge builds here... Proton rebased to 3.19 including some code to optimize the process scheduler priorities to reduce priority inversion effects, and bleeding edge dxvk from git built als winelib. This boosts SOTTR performance from 19 to 33 fps for me here (even 35 fps with latest wine-3.19). And it reduces stutter and fps dips in TW3 and PoE. Also, intermittent freezes in SOTTR are fixed. I'm also working on some avrt patches so that native xaudio can properly gain realtime priority (currently, only built-in xaudio does that, and only with staging patchset). I'm going to soon push these updates to my repository but I'm currently not satisfied with it, and I want to test quality a little more. Also, wine had some commits lately breaking compatibility with esync and d3d related patches from Proton which I need to iron out (I think I fixed most by now).

I don't think that the wine version has anything to do with it, or if it has, it's something that'll show up here as soon as Proton would be officially based off a newer wine version.

kakra commented 6 years ago

@doitsujin Is it possible that the patch you've attached just displays a bunch of newlines? I currently cannot reproduce it in Witcher 3 but it now occurs in SOTTR.

doitsujin commented 6 years ago

Ah yeah, sorry. This one should work: dxvk-error.patch.txt

Again, SOTTR works fine on my end.

kakra commented 6 years ago

SOTTR also radically dropped performance for me during one of my last rebases, from 30 fps to 10 fps (with vsync+triple buffer). But I don't know if this is due to code changes in wine-master or in DXVK. I'm currently trying to figure out if my wine-master rebase went wrong. There are currently many conflicting changes going on and I'm reintegrating patches from their updated sources now. I already reverted my own code changes as a first step but that didn't help. So there seems nothing wrong with those. Ah well... sigh

doitsujin commented 6 years ago

Can you just test things with a clean wine-tkg setup (if you're on arch) or something similar to rule out issues with your wine build?

kakra commented 6 years ago

@doitsujin Okay, something strange is going on. Out of desperation, I zapped the shader cache from $STEAMAPPS/shadercache/$GAMEID (both DXVK and Nvidia) and the crash in SOTTR is gone, plus it's back to normal performance (the perceived performance even looks smoother now). The first benchmark run was clearly full of stutters as expected. Subsequent runs are fine now. Also, graphic distortions in SOTTR are gone (like Lara missing her clothes or hair).

Does this make sense to you? I wonder if TW3 benefits from a cache clear, too. Let me try...

PS: Don't try to reproduce Lara missing clothes and expecting some fun, the developers seem to have thought of this. :-)

doitsujin commented 6 years ago

That's weird and should probably not happen, but yeah, might be worth tryng for TW3 as well.

kakra commented 6 years ago

Is the cache depending on the DXVK version somehow? And are there safeguards against broken shader caches?

Or: s/shader cache/state cache/

kakra commented 6 years ago

Okay, I already found that there's a safeguard using sha1 sums of each state cache entry, and a version header. So how did it break for me?

doitsujin commented 6 years ago

Not sure. Did you manage to confirm whether it was DXVK's state cache or the Nvidia driver cache that was causing issues?

kakra commented 6 years ago

I nuked both and only then discovered that this wasn't the best idea to find which one actually caused the problem. :-(

kakra commented 6 years ago

Okay, I got TW3 to crash again, this time logging worked (that logging patch should be in mainline, shouldn't it?):

0029:err:clipboard:convert_selection Timed out waiting for SelectionNotify event
0029:err:clipboard:convert_selection Timed out waiting for SelectionNotify event
DxvkMemoryAllocator: Memory allocation failed
terminate called after throwing an instance of 'dxvk::DxvkError'
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding c4
004d:fixme:seh:dwarf_get_ptr unsupported encoding 7d
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding c4
004d:fixme:seh:dwarf_get_ptr unsupported encoding 7d
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:err:seh:call_stack_handlers invalid frame 3986f519 (0x39672000-0x39870000)
004d:err:seh:NtRaiseException Exception frame is not in stack limits => unable to dispatch exception.
kakra commented 6 years ago

Looking at the code, it seems like I should somehow manage to reproduce this error even wtih DXVK logging turned on...

doitsujin commented 6 years ago

DxvkMemoryAllocator: Memory allocation failed indicates that you're running out of memory (not necessarily VRAM).

kakra commented 6 years ago

Okay, I renamed the issue title to reflect the original problem. I think the "cache corruption" in SOTTR is really a different issue and should be reported separately by me if it occurs again.

It's strange that this can happen even very early after starting the game, read: When I just loaded a saved game the first time after starting The Witcher 3. I'll report back with new findings.

Actually, my system was loaded with some development applications which like to take a good amount of RAM while this issue occurred the last time. But it still had plenty of RAM left, around 8 GB. After all, TW3 is usually not THAT memory hungry (being an older game).

kakra commented 6 years ago

Here's an update:

err:   DxvkMemoryAllocator: Memory allocation failed
  Size:      134217728
  Alignment: 256
  Mem flags: 0x7
  Mem types: 0x681
DxvkMemoryAllocator: Memory allocation failed
terminate called after throwing an instance of 'dxvk::DxvkError'

# free -m
              total        used        free      shared  buff/cache   available
Mem:          15931        8063        1415         184        6452        7092
Swap:         67583        1304       66279

# nvidia-smi after the crash
Fri Nov  9 21:03:35 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54.09              Driver Version: 396.54.09                 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| 54%   43C    P5    N/A /  75W |   1697MiB /  4006MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1040      G   /usr/libexec/Xorg                           1120MiB |
|    0      2537      G   /usr/bin/kwin_x11                             47MiB |
|    0      2547      G   /usr/bin/krunner                               1MiB |
|    0      2549      G   /usr/bin/plasmashell                         241MiB |
|    0      4762      G   ...quest-channel-token=6856907186180940681   256MiB |
|    0      6145      G   ...ra/.local/share/Steam/ubuntu12_32/steam    20MiB |
|    0      6153      G   ./steamwebhelper                               1MiB |
|    0      6172      G   ./steamwebhelper                               4MiB |
+-----------------------------------------------------------------------------+
kakra commented 6 years ago

I played this game for extended hours (sometimes 12 in a row, yes I'm an addict of this game) with previous versions of DXVK. So I wonder why I see this now.

doitsujin commented 6 years ago

Does it still work on older versions?

I don't see why allocating a 128MB buffer in system memory would suddenly fail when it previously didn't, especially since the memory allocator hasn't been touched in a long time.

kakra commented 6 years ago

After trying some games, I see that multiple games are affected... Skyrim SE freezes on loading screens or in the middle of the game, looking at the logs I also see DXVK complain that very moment about memory.

It looks like Chrome hogs a lot of GPU memory, Xorg was holding almost 3 GB of GPU memory. Restarting Chrome fixes that, and stopping Chrome gets rid of the issues in Skyrim SE. I don't think it has anything to do with the DXVK version but coincidence is that other processes occupy GPU memory. Shouldn't such memory swap out to system memory? Maybe something changed in the NVIDIA driver?

doitsujin commented 6 years ago

System memory that needs to be made visible to the GPU cannot be swapped out as far as I'm aware, and the Nvidia driver might have further limitations (probably for a good reason). fwiw I've seen similar issues on amdgpu under low-memory conditions, although not directly related to DXVK.

In any case, if you consider this issue resolved by closing third-party applications, please close the issue.

kakra commented 6 years ago

There's definitely a bug somewhere leaking memory... If I play long enough, VRAM eventually fills up to 3.7 of 4 GB and then games either freeze, crash or behave strangely (like flickering or missing textures/models). Something changed, I'm just not sure what. I'm using the same NV driver version since some time now, so it's not too likely that the graphics driver changed something. I followed DXVK master closely. Maybe something new in DXVK triggers such a bug?

It's very likely possible that the bug was there earlier but something triggers is much earlier now. I've seen similar problems on very rare occasions before but only after very long gaming sessions.

doitsujin commented 6 years ago

You can monitor DXVK's memory consumption (both VRAM and mapped system RAM) with DXVK_HUD=memory. I haven't seen any behaviour that would indicate a leak.

dlshinobi commented 6 years ago

I've also notice (DXVK_HUD=memory) that on some games like Dark Souls 3 my old lady GTX 960 with 2 GB vram is unplayable. Memory consumption is about 1.8-2.0 GB (just like in windows), but just try to sit at the bonfire, death or teleport to another area and oh boy it spikes to ~2.6-3.4 GB. And now I'm playing Dark Souls 3 with bullet time (slow motion). Fallout 4, Skyrim SE, ReCore DE also eats way above my vram limit. But games like Dark Souls Remastered, Deus Ex HR, Divinity OS 2 DE, GTA V, Hard Reset Redux, Shadow Warrior 2, Witcher 3 works perfect.

lieff commented 6 years ago

I have same issue with 4GB GPU mem and SOTTR. But I can work around it using medium textures. It's looks like game do not free memory immediately but using some garbage collector/reuse mechanism (or just leaks and it's not so noticeable on windows). On windows some GPU defragmenter is working and also keeps hottest memory resident on GPU while in dxvk if texture allocated on the host memory than it keeps on host forever.

kakra commented 6 years ago

I changed the title because it is not game specific. It's visible in different games. I'm playing SOTTR even with low textures and it happens after 3-4 hours (sometimes earlier). In SOTTR another effect probably resulting from this is the second benchmark always runs slower than the first, going into a game level, then returning to the benchmark, it's even slower now. So there's overhead accumulating somewhere.

SOTTR: Results in graphical glitches first, then freezes, finally crashes to desktop after some time of thrashing the harddisk TW3: Just freezes, often just after the initial saved game load, on rare occasions it crashes to desktop SkyrimSE: Just freezes, either during a loading screen (endless loading screen) or midst in the game, sound continues to play

Anyone of you guys know what changed lately in your systems? This effect was much less visible some time ago (or even didn't exist, hard to say).

kakra commented 6 years ago

Would it help to use dxvk.allowMemoryOvercommit = True?

doitsujin commented 6 years ago

You could try, but that only has an effect when you actually run out of VRAM. With that option disabled, DXVK falls back to a system memory allocation.

But you won't ever run out of VRAM in Witcher 3, for example.

kakra commented 6 years ago

This needs further testing but it seems to help: SOTTR runs a lot slower if Xorg already occupies a lot of VRAM but it didn't crash in a quick test. Also the other games affected seem to no longer crash (and, in contrast, see no slowdown). But I need to load the system a little more.

kakra commented 6 years ago

@doitsujin Okay, after testing for a while, available VRAM makes a huge difference for performance. All games crash if I do not enable overcommitting. Crashing is a bad experience, so we probably need something more intelligent here. Meanwhile, I patched my version of DXVK to enable overcommitting by default which gets rid of all the crashes I experienced lately. I rather reach a savepoint with bad performance than have a crash which forces me to repeat parts of the game. There's always an opt-in to gracefully quit the game and clean up VRAM somehow.

Games are affected in different ways, I've tested two so far:

  1. In SOTTR, under low VRAM, performance degrades vastly, by a factor of at least 2, sometimes 3 or 4. This has a very big impact.
  2. In TW3 the frame dips become much more visible: With almost all VRAM available, the frame dips are hardly noticeable, I managed to run TW3 with only ~800MB VRAM available and frame dips became very noticeable. But using the mod "HD Reworked Project" with its optimized configuration helped a lot: While it adds very high quality textures to the game (which should even worsen the situation), it also includes a configuration with optimized texture streaming parameters and that helps a lot: TW3 feels much smoother.
  3. I didn't test the third so much yet but SkyrimSE doesn't seem to have noticeable performance problems with low VRAM despite I'm using HD and immersion mods which make it use 2-3 GB of VRAM.

So first: Maybe this issue should be tagged "performance".

Second: Is there anything intelligent that DXVK could do about memory management? In retrospect to the issue reporting degrading performance with high quality textures in SOTTR, is it possible for DXVK to somehow prioritize what goes to VRAM and what goes to system RAM? Is it possible that DXVK could discard allocations from VRAM, or swap them between system memory and VRAM based on usage patterns?

I wonder how Windows manages this... It either has ways to manage and swap VRAM with system RAM, or the games behave different there and can actually manage this on their own. Then the question is, why can't they do it when running under wine/DXVK?

SveSop commented 6 years ago

I must be doing something wrong i guess. Replaced Proton 3.16 dxvk files with 1724d51 and played about 2 hours. Never went past 2.8GB allocated mem.

GTX970 4GB w/396.54.9 driver. 1080p all on "Ultra" in TW3. No huge "dips" i guess, but fps is not stellar (around 50'ish fps) Perhaps you use 4K? Should i opt to try to use that "HD Reworked Project"?

EDIT: Oh, i realized, you do perhaps load vram near 4GB to see if stuff starts lagging too much? So, just burn some vram to see?

kakra commented 6 years ago

@SveSop Yes, I burned some GPU VRAM by opening a lot of Chrome tabs, Spotify and some other Chrome-based apps. Even Steam itself is Chrome-based (at least the webview component). I don't know why Chrome-processes eat so much VRAM but it can become an issue.

You could try the HD Reworked Project to see if it reduces fps dips for you. It felt smoother here.

BTW: After unpacking the mod and extracting the data files and the config folder, go into the game settings and activate the new texture quality level to actually use the configuration.

K0bin commented 6 years ago

When DXVK fails to allocate a resource in VRAM, it allocates it in RAM which of course comes with a pretty big performance penalty. That shouldn't crash though. If anything it should crash only if you allow overcommiting.

Windows does the same but it is a bit smarter and also moves existing resources out of VRAM to make room for more important ones.

kakra commented 6 years ago

@K0bin I think it's quite the other way: Overcommitting allows memory to be allocated from system RAM even if the game didn't ask for it, otherwise it fails which crashes the game (because DXVK refuses to continue).

I wonder if it is possible to make DXVK similar smart and let it move resources out of the way. But I guess this has to be fixed lower down the graphic stack layers, i.e. the driver itself or Xorg must be willing to give up resources when requirements are coming in...

K0bin commented 6 years ago

@K0bin I think it's quite the other way: Overcommitting allows memory to be allocated from system RAM even if the game didn't ask for it, otherwise it fails which crashes the game (because DXVK refuses to continue).

No, its not the other way, see #527.

I wonder if it is possible to make DXVK similar smart and let it move resources out of the way. But I guess this has to be fixed lower down the graphic stack layers, i.e. the driver itself or Xorg must be willing to give up resources when requirements are coming in...

It's possible but it's very hard and a lot of work. This has to be done inside DXVK though, Vulkan explicitly leaves memory management to the application. That's probably not going to happen any time soon, so your best option is to just lower your graphics settings.

kakra commented 6 years ago

@K0bin Interesting... DXVK definitely terminates here when the error occurs. Revisiting the allocation code, it shouldn't do that. It should just return DxvkDeviceMemory() instead of result. So I conclude something is going wrong just a little bit later? Like accessing null pointers? Ah no, it throws from here: https://github.com/doitsujin/dxvk/blob/4db5c21ec5b983334431e9e8f21b9cbaa2ac7d2a/src/dxvk/dxvk_memory.cpp#L197

doitsujin commented 6 years ago

It only throws the error when neither the VRAM nor the System RAM allocations succeed. Once that happens, it's too late to continue in any meaningful way anyway.

SveSop commented 6 years ago

I think its not a good test for me with GTX970 to use over 3.5GB vram for comparing performance, cos of the 970 memory configuration thingy. https://hexus.net/tech/news/graphics/79925-nvidia-explains-geforce-gtx-970s-memory-problems/

So this "bug" is mostly crashes due to OTHER apps using up gpu memory, and dxvk not pushing this (useless) memory usage out of the way to up performance? :)

kakra commented 6 years ago

@doitsujin Yes this is what the code says but I'm sure there's still sysmem available, or the system could just swap stuff out to disk to make some small allocation of 128M available. Could this be a driver bug? After all you're not allocating through standard C/C++ functions but through vulkan functions.

doitsujin commented 6 years ago

i don't know what the issue is, but it seems that something eats unusual amounts of memory on your end. Have you tried running those games on a simple WM (like fluxbox) without any applications running in the background?

SveSop commented 6 years ago

Stupid question from my end: Is the problem here that there is a memory leak eating more and more vram until the game crashes in eg. TW3? If so, i dont really see the same problem on my end, as those 2.8GB allocated vram shown happened in the first 2-3 minutes of playing TW3 yesterday, and did not increase over the course of 2 hours of me playing, loading/saving games several times without change. In use (commited? dont remember the wording) hovered around 2.2GB - 2.5GB mostly.

I did not try to crash the game on purpose by overloading vram in some other manner tho.

Just trying to troubleshoot on a different system than yours to weed out any possible non-dxvk issues.

SveSop commented 6 years ago

Did a wee bit of testing back and forth, and can't really say i am able to make something eat so much vram.. Opening 10 chrome windows did not chunk out a huge deal of vram either tbh, but for all i know you could be running 200+ windows while editing a 4K movie in the background :)

What i DID notice however (no difference between Proton 3.16 w/dxvk 0.90 vs. building my own dxvk from git) was the "Memory Allocated" from the DXVK hud only increased and never went down even if i loaded a saved game with less "Memory used".

That might be intended, and tbh SHOULD not be an issue as long as it is <4GB i guess? Eg.

nVidia-SMI: Witcher3: 1488
nVidia-SMI: (Total): 1880
DXVK Memory Allocated: 2030
DXVK Memory Used:      1837

Loading save games from different spots + running around and so on would up the "Memory Allocated" upwards, even tho "memory used" goes up/down as needed. Not really sure what the discrepency between nVidia-SMI (who i would deem to be "accurate" in usage directly from the driver) and "Memory used"? nVidia-SMI "Total" memory was 1880, and somewhat more in line with "DXVK Memory" i guess, but nVidia one includes Xorg, gnome-shell and stuff like that, so i would not think DXVK would be able to "read" that?

I did not test hours upon hours of gameplay, but from the 2 hours i played yesterday mentioned above, i had 2.8GB "Memory allocated", so i guess it MIGHT be something that just grows and grows until it gets a problem? Is the mem allocation something that SOMETIMES gets cleared out? (Or rather SHOULD).

kakra commented 6 years ago

DXVK uses a chunk allocator, thus it usually doesn't cleanup because some bit of information will always be left in a chunk. Chunks are allocated probably in 64 MB blocks, within each chunk you'll have a free list of blocks from which DXVK will allocate into the biggest block available (except a free block matches exactly in size), if the allocation request type matches the chunk type. It's similar to how btrfs manages its device space. If a chunk becomes completely free, it could be de-allocated, but that really doesn't make much sense because probably you would request a new chunk of memory just moments later. If no free block can be found, a new chunk will be allocated from the device. Thus, it's normal that the memory usage only increases until it peaks at some value. A chunking allocator is pretty much the best thing you can do if you need to handle different and incompatible types of allocations. You just need to properly tune the chunk size so you can fit all types of allocations without too much overhead and without too much wasted space. The "allocated" counter is probably what's been allocated as chunks, the "used" counter is what's actually used across all chunks. The difference is wasted space which wasn't used or couldn't be used due to incompatible memory type flags.

As far as I understood, chunks are allocated from the driver or the vulkan layer which in turn decides if it allocates from the device or from system memory (depending on the flags given). Within each chunk, memory is managed by DXVK itself by keeping lists of free blocks (pairs of offset/size).

What happens in my case seems to be: DXVK asks vulkan for a new chunk of device-local memory, vulkan says "no", DXVK tries again without the "device-local" flag, thus it allows to use non-local memory which is slower because it is accessed over the PCI bus. But the vulkan says "no" again. But there's plenty of system RAM available to allocate such a chunk. I can only guess why that is. Maybe vulkan cannot find system memory that would be mappable by the GPU. Not all of your physical address space may be available to the GPU because of chipset limitations, or because other devices already mapped that, i.e. another GPU, or I don't know what.

Overcommitting "solves" this because it lets vulkan pretend that unused chunk memory isn't going to be used any time soon. Thus, such memory is still available to other allocations. Your Linux kernel does a similar thing: Allocated memory only becomes mapped to real memory if something writes to the memory blocks. Otherwise it stays idle. It accounts for the allocated RAM but not the used RAM. It's the "virt" counter you'd see in top: virt is allocated space. But things start crashing if one application now actually wants to use its allocated but yet unused memory: The GPU won't find any space to put that request, it fails, crash. Linux solves this by swapping to disk. The GPU could request the driver to swap to sysmem. But as I understood, vulkan leaves that completely to the application. So DXVK would be in charge of doing so. But DXVK doesn't implement this. It's complicated. It should be avoided as long as you can.

So in turn that means: Overcommitting does not crash for me, thus a lot of VRAM is only allocated but not used. So Chrome (or Xorg) seems to allocate a lot of VRAM just because it can but it never uses it.

To the experts: Does this make sense?

I'm running with two monitors, left one is a full-HD TV (which I actually use for gaming from the couch, with a wireless controller), and the right one is a 4k PC monitor. I do no video editing but some browser tabs may host paused or finished youtube videos (which tend to be streamed in 4k quality). I also have multiple gmail tabs open. At least back in 2014 there was a bug in Chrome where it would slowly eat away your VRAM if you have gmail opened over longer periods of time. But that was fixed since then.

So overall I'm probably running a virtual framebuffer of (1920+3840)x2160 pixels at 32 bit color depth (I think it doesn't use 24 bit buffer representation, but color space is 24 bit). With triple buffering, that's about 142 MB of screen buffer. Probably there's some padding and alignment but nothing to worry about...

Or a little less technical and abstract:

Think of your desktop (the real wooden one where you put your keyboard and mouse on) as your VRAM. Everytime you want to do something with the GPU arrange a peace of coloured paper onto your desktop. Put your information in the paper sheets. Different types of information will use different coloured paper. At some point either your desktop fills up and you can only use the space left on paper, or the space left on paper is enough to work with. If your desktop space fills up, you could start putting paper sheets elsewhere... On the floor... or into some folders. But accessing these is much slower then. Overcommitting is like using scissors to cut parts of paper off and replace those parts with a different color. But if the other application now has to put information there and there's no space left to put the cut-off snippets, things will crash.

SveSop commented 6 years ago

Thanks for a thorough explanation :)

I use 2x1080p monitors, but rarely have i ever seen vram used past 2GB in the cruddy old games i play... save for TW3 (probably old aswell), and have after a while of playing up toward 2.8GB allocated mem. Now.. i dont do 12+ hour gaming sessions without logging off, nor do i have many many chrome tabs open while i game. I DO however sometimes watch a video of some quest, or read some shit WHEN i play, but nowhere near going oom of vram. This COULD ofc be worse if i play for a lot longer, as i said (and to your explanation) the game COULD be allocating chunks until vram is all spent? Dunno.

How long does it take you if you do a clean boot and just load up steam and start TW3 until you get errors? Cos troubleshooting stuff that is in the realm of "Oh.. yeah, you need to do a 12 hour playingsession before that happens" is kinda.. uhm.. Well :)

As i said, chrome seemed hard pressed to really use much vram for me, so i am looking for something different perhaps.. some example code that can be started over until vram is spent perhaps? Found some references to GLSLHacker (GeeXLab) and some 4GB vram test thingy, but was not able to find that anymore. Opening a 4K video on youtube seems to be using a whopping 70MB of vram for me, so i dunno...

doitsujin commented 6 years ago

@kakra

The GPU could request the driver to swap to sysmem. But as I understood, vulkan leaves that completely to the application. So DXVK would be in charge of doing so.

Actually no, it isn't. Once a memory chunk is allocated, Vulkan apps don't really have to bother with it, residency is magaged by the driver. Even for device-local memory types, there is no guarantee that memory allocated from them is actually located in VRAM, it can be paged out if necessary.

kakra commented 6 years ago

@doitsujin So we are back to "that doesn't seem to happen here". It only strengthens the theory this is a driver / graphic stack issue here... Maybe related to configuration or hardware memory layout...

kakra commented 6 years ago

Okay, I managed to let SOTTR allocate more than 4GB of memory now without a crash, also TW3 allocated around 3GB without a crash now - with Chrome and some other windows opened. I have a theory of what was going wrong in my system but I need to test this a little more. The "slow down over time" issue in SOTTR also seems to be gone but since my system does a lot of background activity currently, I'd like to defer the performance testing a little more. Currently, stuttering is a lot more apparent now and I'm not sure if it comes from background activity or switched settings. But the games seem to cope well now with Xorg/Chrome allocating a lot of memory, overall memory footprint of those seems a little bit lower now. I'll report back.

Fun fact: Sometimes it helps to write elaborated texts explaining things to get the clue where a problem is. :-)

kakra commented 5 years ago

Apparently, during testing various kernel configurations I managed to crash my filesystem the hard way. I probably lost some important changes to the wine code, one of which is hard to recreate. Replacement drives are ordered because I want to keep around the broken file system for trying recovery. This throws me back about 1-2 weeks, so I'm going to pause working on this for a few days.

But to recap what I found out so far: Vulkan (or NVIDIA) seems to interact with THP very badly (at least in combination with wine). It wasn't able to allocate more RAM because there was just no mappable memory block left to allocate. This is probably a memory fragmentation issue. Usually, the kernel would defer huge page creation then. But I also noticed that my kernel didn't properly enable IOMMU (which seems to be important for NVIDIA). I was still testing that part when the crash occured.

Since THP can be a pretty nice, performance enhancing feature, I wanted to work out a proper configuration and document that. First tests showed that it makes a difference in performance. Overall fps was mostly identical but I did notice audio-dropouts every now and then which I didn't notice before.

So I probably take the chance to rebase my work to wine 3.21 then. I was just finished with preparing and cleaning up the 3.20 release when everything went down the virtual drain. :-(

I think I'm back up running by next weekend. Thank you, Murphy, that I discovered my daily backup wasn't working that very same day.

Note to myself: Don't mix zswap with some workloads. Push often even if still WIP.

Conclusion: If someone is seeing this issue, too, it may be due to THP being enabled and not being fully and/or correctly configured. Could you check? grep ^ /sys/kernel/mm/transparent_hugepage/*