doitsujin / dxvk

Vulkan-based implementation of D3D8, 9, 10 and 11 for Linux / Wine
zlib License
13.17k stars 848 forks source link

[d3d11] [regression] [dissected] Automobilista 2 (1066890) severe performance regression since v1.9.4 #2502

Closed ZakMcKrack3n closed 2 years ago

ZakMcKrack3n commented 2 years ago

Regression (77fps->30fps with the same scene) did start with the following commit: https://github.com/doitsujin/dxvk/blob/d34bbdb58e547c943ef1eaeac38a88c35f51f52a/src/dxvk/dxvk_memory.cpp#L421

As the master branch has additional changes I am unable to say how this translates to current code base.

System information

Commenting the condition and code in 1.9.4 branch will restore performance to 1.9.3 levels. Unfortunately setting log levels completely freezes my game launch and I can not provide any of them.

Other games tested with 1.9.4 vs 1.9.3 showed no obvious performance change (Assetto Corsa / Dirt Rally 2.0)

doitsujin commented 2 years ago

What's memory utilization like (DXVK_HUD=memory)?

Could be that frequent allocation and dealloaction is causing AMDGPU to not page in certain VRAM allocations or something, especially since you only have 2GB of VRAM, although this is exactly the kind of situation that we're trying to avoid since that commit.

Alternatively, this could be a Nier Replicant situation where you'd want to try out dxvk.apitraceMode = True in dxvk.conf.

doitsujin commented 2 years ago

"Just" installed the game (why exactly is this 80GB?) and even at 1080p lowest settings it requires more than 3GB of VRAM, so yeah, this is not going to run well on a 2GB GPU. Not much we can do here.

ZakMcKrack3n commented 2 years ago

Here are the stats (they are near identical) when running on 1280x720 with everything set to low and only 11 cars (amount of cars = more VRAM gone , so its certainly playable)

dxvk.apitraceMode = True

Was one of many tweaks that I tested with no effect (checked other tweaks listed in config.cpp for other games).

DXVK_HUD stats for v1.9.2 through v1.9.4 stock dlls (master for me is almost identical to v1.9.4):

DXVK v1.9.2 (stock with Proton 6.3):
FPS: 73.3
Queue submissions: 5
Draw calls: 2655
Dispatch calls: 0
Render passes: 62
Graphics pipelines: 1792
Compute pipelines: 1
Vidmem heap 0: 1627MB (90%)
Vidmem heap 1: 211MB (6%)
Vidmem heap 2: 149MB (58%)
GPU: 100%

DXVK v1.9.3 (dropped into Proton 7):
FPS: 77.5
Queue submissions: 5
Draw calls: 2664
Dispatch calls: 0
Render passes: 57
Graphics pipelines: 1792
Compute pipelines: 1
Vidmem heap 0: 1627MB (90%)
Vidmem heap 1: 211MB (6%)
Vidmem heap 2: 149MB (58%)
GPU: 100%

DXVK v1.9.4 (Stock Proton 7):
FPS: 29.2
Queue submissions: 4
Draw calls: 2707
Dispatch calls: 0
Render passes: 57
Graphics pipelines: 1792
Compute pipelines: 1
Vidmem heap 0: 1734MB (96%) 1627MB used
Vidmem heap 1: 224MB (6%) 211MB used
Vidmem heap 2: 176MB (58%) 149MB used
GPU: 100%

I understand that I am running this game on a potato, but staying in certain limits I was able to run it just fine, also I use a DE without compositor for now , this gets me from around 311MB VRAM utilisation with steam running to just over 100MB.

As I understand the code should help with memory pressure , is there some hidden overhead for the GPUs memory controller or CPU in certain edge cases ? Should I investigate CPU utilization ? (Core i5 4690 no K running at 3.9Ghz)

I will check the performance on another track with no cars (low VRAM utilisation) , so if I hit the same performance difference its not directly a VRAM issue ?

EDIT: Also do not let the VRAM allocation of the main menu fool you , it seems to load all cars so they can be previewed I guess. While racing VRAM utilisation is dependant on track + amount of cars combination (Also the game, just as Dirt Rally 2, includes all tracks and all cars, regardless of bought DLC, even the DEMO).

Now tested without opponents, same track, checking with radeontop, I am not hitting VRAM limits (reports 1757MB):

DXVK v1.9.2 (stock Proton 6.3):
FPS: 68.2
Queue submissions: 5
Draw calls: 1404
Dispatch calls: 0
Render passes: 57
Graphics pipelines: 1814
Compute pipelines: 1
Vidmem heap 0: 1476MB (82%)
Vidmem heap 1: 125MB (4%)
Vidmem heap 2: 148MB (58%)
GPU: 99%

DXVK v1.9.4 (Stock Proton 7):
FPS: 36.3
Queue submissions: 5
Draw calls: 1456
Dispatch calls: 0
Render passes: 57
Graphics pipelines: 1814
Compute pipelines: 1
Vidmem heap 0: 1575MB (87%) 1476MB used
Vidmem heap 1: 144MB (4%) 125MB used
Vidmem heap 2: 176MB (68%) 148MB used
GPU: 100%
doitsujin commented 2 years ago

As I understand the code should help with memory pressure , is there some hidden overhead for the GPUs memory controller or CPU in certain edge cases ?

Not really, and 100% GPU load indicates that the GPU is just busy fetching data over system RAM because there seems to be one allocation that gets paged out into system RAM. Again this isn't really something we can improve on right now, and I would suggest just using an older DXVK build for this game that just happens to work better. I can't reproduce any performance rgeression on more powerful hardware either.

ZakMcKrack3n commented 2 years ago

But I already showed that my VRAM is not exhausted, and 100% GPU load is because it renders as much frames as possbile? Checking radeontop with v1.9.2 vs v1.9.4 with proton 7 shows me that v1.9.4 uses LESS VRAM and GTT:

radeontop for v1.9.2:
All values at 100% utilisation except
Vertex Grouper + Tesselator 76%
Texture Addresser 93%
Primitive assembly 76,76%

1987M / 1991 VRAM 99,82%
1903 / 3063 GTT 62,12%

radeontop for v1.9.4:
All pipes pretty much 100% only depth block util. 80%

1761M / 1991 VRAM 88,47%
507 / 3063 GTT 16,54%

So by NOT allocating as much video memory it somehow saturates PCIe bus because it accesses system RAM. Ok , as I am on PCIe 2.0 and my card has PCI 3.0 8x so it essentially kills performance.

I guess you can close it (wontfix :-1: ), I do not have the skill to optimize dxvk code, I just thought that hitting vram limit with 4 and later 8 gb cards could impact performance too, but I have no way to test.

doitsujin commented 2 years ago

But I already showed that my VRAM is not exhausted

This doesn't mean that all our allocations are resident, especially since the driver itself needs some memory too. AMD drivers allow us to oversubscribe VRAM, and the driver can dynamically page out allocations to system RAM and I'm fairly certain that this is exactly what's happening here.

You could try this branch, it reduces the chunk size and disables HVV, if that helps I might just enable this behaviour by default on cards with 2GB or less. But in general DXVK just won't be a good experience on that kind of hardware, even if our memory management was smarter than it is, we wouldn't have enough control over it to make sure things work well.

ZakMcKrack3n commented 2 years ago

I tried the branch , unfortunately it exactly matches v1.9.4 performance wise.

doitsujin commented 2 years ago

Not much we can do then. I'm not going to revert the changes because they genuinely help in other situations.

ZakMcKrack3n commented 2 years ago

Ok , I am now in the process in making the chunksize configurable , because it seems this was the right call.

With a chunksize of 16 I now get v1.9.2 ish performance (a bit less than stock v1.9.3), will test various chunk sizes to see what works best.

Right now I am hacking it in as dxvk option because it is already accessible in dxvk_memory.cpp :+1:

EDIT: Choosing a more demanding scene to have some sort of worst case I got the following results (using current master with new config option):

 size: fps
128MB: 19.1
 64MB: 19.1
 32MB: 41.2
 16MB: 40.2
  8MB: 41.6
  1MB: 44.9

This also translates for my initial test beating v1.9.3 best fps of 77fps with over 80fps when using the lowest chunk size. If this only affects some games, maybe make it configurable ?

Will test with some other games to see if this generally helps my hardware.

ZakMcKrack3n commented 2 years ago

Also tested using more VRMA than available by increasing the texture detail:

Automobilista 2 slightly overcommiting ~110%

128MB: 25.9
  1MB: 29.5

Automobilista 2 overcommiting ~ 150%

128MB: 9.6
  1MB: 7.7

And finally Dirt Rally 2.0 with over 147% mem usage

128MB: 52.6
  1MB: 39.5

So an all size fits nobody situation I guess :)

doitsujin commented 2 years ago

I'm not interested in making the chunk size configurable, even if 16M works well right now there's no guarantee that it will work in the future since it depends what the driver does behind your back. Not to mention that we already use small chunks for small resources.

Ideally we'd have TTM page out memory allocations that aren't used while keeping all high-priority allocations in VRAM at all times, the information is there, but I don't think the driver does that to the necessary extent at the moment. This probably does already work better on Windows even with DXVK than it does on Linux.

ZakMcKrack3n commented 2 years ago

I'm not interested in making the chunk size configurable, even if 16M works well right now there's no guarantee that it will work in the future since it depends what the driver does behind your back. Not to mention that we already use small chunks for small resources.

Ideally we'd have TTM page out memory allocations that aren't used while keeping all high-priority allocations in VRAM at all times, the information is there, but I don't think the driver does that to the necessary extent at the moment. This probably does already work better on Windows even with DXVK than it does on Linux.

No problem, I did it only for faster testing. I think there is everything said and done, if there is anything more I should test, say the word, otherwise this issue can be closed.

ZakMcKrack3n commented 2 years ago

Retested with never DXVK versions:

Sample race cockpit view from back of the grid: v1.9.3 -> 62.2 fps v1.9.4 -> 27.7 fps v1.10.0 -> 72.6 fps v1.10.1 -> 73.1 fps

Sample race menu looking over grid: v1.9.3 -> 39.8 fps v1.9.4 -> 21.8 fps v1.10.0 -> 38.9 fps v1.10.1 -> 38.7 fps

Whatever optimizations took place also fixed the issues I had with v1.9.4. :+1:

ZakMcKrack3n commented 2 years ago

Just for future reference: Unfortunately after a new round of testing, there are still combinations of races with "clear" weather that essentially half the frame rate with > 1.9.3 versions on my RX460 hardware.

emansom commented 1 year ago

Is there a way to migrate other applications running (e.g. Firefox) from VRAM to system memory (GTT), so that more becomes available for DXVK?

Having for example GameMode do that on enablement, would solve issues on e.g. a PCIe bandwidth (64 bits bus) and VRAM (2 GB) limited RX 550.

emansom commented 1 year ago

@yuzhaogoogle MGLRU for GTT next? :smiley:

Blisto91 commented 1 year ago

I do not know of a way and without any insight i would imagine that would be a job for the kernel or the applications themselves.

I don't know if you know @ yuzhaogoogle but if not then please don't ping random people from random issues. If you have a issue with or question about dxvk feel free to open a new one 🙂