felixdoerre / primus_vk

Vulkan GPU-offloading layer
BSD 2-Clause "Simplified" License
230 stars 18 forks source link

Measuring performance #31

Closed jeromegn closed 5 years ago

jeromegn commented 5 years ago

So now that I've been able to use primus_vk and happily run my games, I've noticed there's a big performance difference with windows or even my Intel GPU.

On windows running Path of Exile with my "Nvidia GTX 1070 with Max-Q Design", I get pretty great FPS in 4K resolution (around 60-70 FPS). My monitor is 60hz so I only care about 60 FPS. I could even do with 30 FPS if it was stable.

On linux I get around 15-20 FPS at the same resolution. Of course there's wine and other overhead involved.

I'm not sure what's causing it exactly, but I decided to try and measure the performance outside of wine to reduce variable influences. I settled on vkmark, installed via pacman on Arch.

For reference: I'm using a Razer 15 2018 with the beefiest processor: Intel(R) Core(TM) i7-8750H CPU (goes up to 4Ghz, 12 logical cores).

My integrated GPU is: Intel(R) UHD Graphics 630 (Coffeelake 3x8 GT2)

Here are the results running on the Intel GPU:

$ vkmark -b vertex:duration=5.0:interleave=false -b vertex:duration=5.0:interleave=false
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
=======================================================
    vkmark 2017.08
=======================================================
    Vendor ID:      0x8086
    Device ID:      0x3E9B
    Device Name:    Intel(R) UHD Graphics 630 (Coffeelake 3x8 GT2)
    Driver Version: 75509764
=======================================================
[vertex] duration=5.0:interleave=false: FPS: 5468 FrameTime: 0.183 ms
[vertex] duration=5.0:interleave=false: FPS: 5925 FrameTime: 0.169 ms
=======================================================
                                   vkmark Score: 5696
=======================================================

Running on the Nvidia GPU:

$ ENABLE_PRIMUS_LAYER=1 optirun --no-xorg vkmark -b vertex:duration=5.0:interleave=false -b vertex:duration=5.0:interleave=false
PrimusVK: CreateInstance
PrimusVK: Getting devices
PrimusVK: Searching for display GPU:
PrimusVK: 0x56395b36b380:
PrimusVK: 0x56395b36b380:
PrimusVK: 0x56395b36b380:
PrimusVK: Got integrated gpu!
PrimusVK: Device: Intel(R) UHD Graphics 630 (Coffeelake 3x8 GT2)
PrimusVK:   Type: 1
PrimusVK: Searching for render GPU:
PrimusVK: 0x56395b36b380:
PrimusVK: Got discrete gpu!
PrimusVK: Device: GeForce GTX 1070 with Max-Q Design
PrimusVK:   Type: 2
PrimusVK: in function: creating device
PrimusVK: Extension: 3
PrimusVK: Extension: 48
PrimusVK: Extension: 48
PrimusVK: spawning secondary device creation: 0x56395b4523d8
PrimusVK: After reset:0
PrimusVK: fetching dispatch for 0x56395b56b510
PrimusVK: GetDeviceProcAddr is: 0x7f5dc7082eb0
PrimusVK: Create Swapchain KHR is: 0x7f5dc7094e40
PrimusVK: CreateDevice done
PrimusVK: Thread running
PrimusVK: getting rendering suff: 0x56395b36b380
PrimusVK: Gpus: 2
PrimusVK: phys[1]: 0x7f5da8000be0
PrimusVK: render queues: 1
PrimusVK:  flags: 7
PrimusVK: in function: creating device
PrimusVK: Extension: 3
PrimusVK: Extension: 48
PrimusVK: Extension: 48
PrimusVK: Support: 0x56395b36b380, 1
PrimusVK: joining secondary device creation
PrimusVK: When startup hangs here, you have probably hit the initialization deadlock
PrimusVK: fetching dispatch for 0x7f5da8004d50
PrimusVK: GetDeviceProcAddr is: 0x7f5dc7082eb0
PrimusVK: Create Swapchain KHR is: 0x7f5dc7094e40
PrimusVK: CreateDevice done
PrimusVK: Create Graphics FINISHED!: 0
PrimusVK: Display: 0x7f5da8004d50
PrimusVK: storing as reference to: 0x56395b56b510
PrimusVK: joining succeeded. Luckily initialization deadlock did not occur.
PrimusVK: Application requested 3 images.
PrimusVK: Creating Swapchain for size: 800x600
PrimusVK: MinImageCount: 3
PrimusVK: fetching device for: 0x56395b56b510
PrimusVK: found: 0x7f5da8004d50
PrimusVK: FamilyIndexCount: 1
PrimusVK: Dev: 0x7f5da8004d50
PrimusVK: Swapchainfunc: 0x7f5dc7094e40
PrimusVK: >> Swapchain create done 0;0x56395b627ff0
PrimusVK: Image aquiring: 3
PrimusVK: Selected render mem: 9;7 display: 0
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating image: 800x600
PrimusVK: Creating a Swapchain thread.
PrimusVK: Get Swapchain Images buffer: 0x56395b627e80
PrimusVK: Count: 3
=======================================================
    vkmark 2017.08
=======================================================
    Vendor ID:      0x10DE
    Device ID:      0x1BA1
    Device Name:    GeForce GTX 1070 with Max-Q Design
    Driver Version: 1753923584
=======================================================
[vertex] duration=5.0:interleave=false: FPS: 266 FrameTime: 3.759 ms
[vertex] duration=5.0:interleave=false: FPS: 269 FrameTime: 3.717 ms
=======================================================
                                   vkmark Score: 267
=======================================================
PrimusVK: >> Destroy swapchain: 0x56395b627ff0

I'm using the headless trick to run without Xorg, but I've had similar results with Xorg.

I'm also running nvtop to see what's happening with my GPU. It's barely breaking a sweat when I'm running vkmark. When I'm playing Path of Exile, it barely goes above 25%, sometimes hits up to 50% of both GPU and Memory usage.

I far from understand how any of this works, but I'm hoping benchmarking might help find some performance culprit.

jeromegn commented 5 years ago

Oh, I see from the other issue (#6) that any benchmark will probably be slow due to the copying bottleneck.

felixdoerre commented 5 years ago

I think that depends on what the benchmark is doing. Primus_vk introduces a rather big constant overhead per frame, so I'd say that the benchmark with 5000 FPS on Intel only is still ok with 267 FPS with primus_vk. A benchmark that renders fewer, but more computationally expensive images should run faster with primus_vk.

To better understand the performance behaviour of the game you're trying to run, you can enable Profilier-Tracing. Just compile primus_vk with this macro (https://github.com/felixdoerre/primus_vk/blob/master/primus_vk.cpp#L79) commented in and the line above commented out. That should give a trace output with enough timing information to understand what is causing the waiting.

A "blind" shot for a possible performance improvement could be to increase the number of allocated images for presentation from 3 to 4,5 or 6. You could do this by changing the number in https://github.com/felixdoerre/primus_vk/blob/master/primus_vk.cpp#L760 Currently primus_vk uses at least 3 threads to copy images around. When you increase this number primus_vk can use more CPU cores to copy images which could lead to better performance.

jeromegn commented 5 years ago

Thanks, I tried 6 and then 12 threads (given I have 12 logical cores). It did seem to help a little.

I also turned on the profiling and got a whole lot of:

PrimusVK-profiling: -1 71801376566 Acquire starting
PrimusVK-profiling: 9 71801517066 got image
PrimusVK-profiling: 9 71801573989 Acquire done
PrimusVK-profiling: 9 71802790419 QueuePresent
PrimusVK-profiling: 9 71805981806 memcpy start
PrimusVK-profiling: -1 71848071485 Acquire starting
PrimusVK-profiling: 10 71848122589 got image
PrimusVK-profiling: 10 71848265314 Acquire done
PrimusVK-profiling: 8 71855621347 memcpy done
PrimusVK-profiling: 8 71855677786 copy queued
PrimusVK-profiling: 8 71855702006 submitting
PrimusVK-profiling: 10 71866551018 QueuePresent
PrimusVK-profiling: 10 71872074797 memcpy start
PrimusVK-profiling: -1 71909883837 Acquire starting
PrimusVK-profiling: 11 71909941349 got image
PrimusVK-profiling: 11 71909995194 Acquire done
PrimusVK-profiling: 11 71929023063 QueuePresent
PrimusVK-profiling: 11 71935012419 memcpy start
PrimusVK-profiling: -1 71967899756 Acquire starting
PrimusVK-profiling: 0 71967946092 got image
PrimusVK-profiling: 0 71968070839 Acquire done
PrimusVK-profiling: 0 71986812283 QueuePresent
PrimusVK-profiling: 0 71990238655 memcpy start
PrimusVK-profiling: 9 71995189528 memcpy done
PrimusVK-profiling: 9 71995348843 copy queued
PrimusVK-profiling: 9 71995364157 submitting
PrimusVK-profiling: -1 72039732284 Acquire starting
PrimusVK-profiling: 1 72039800490 got image
PrimusVK-profiling: 1 72039915793 Acquire done
PrimusVK-profiling: 1 72057408472 QueuePresent
PrimusVK-profiling: 1 72064182292 memcpy start
PrimusVK-profiling: 10 72072176187 memcpy done
PrimusVK-profiling: 10 72072257674 copy queued
PrimusVK-profiling: 10 72072266431 submitting
PrimusVK-profiling: -1 72101441629 Acquire starting
PrimusVK-profiling: 2 72102816732 got image
PrimusVK-profiling: 2 72102962567 Acquire done
PrimusVK-profiling: 2 72119083705 QueuePresent
PrimusVK-profiling: 2 72124162416 memcpy start
PrimusVK-profiling: 11 72148296684 memcpy done
PrimusVK-profiling: 11 72148349457 copy queued
PrimusVK-profiling: 11 72148374080 submitting
PrimusVK-profiling: -1 72172630350 Acquire starting
PrimusVK-profiling: 3 72172753239 got image
PrimusVK-profiling: 3 72174599588 Acquire done
PrimusVK-profiling: 3 72189428685 QueuePresent
PrimusVK-profiling: 3 72194251657 memcpy start
PrimusVK-profiling: 0 72219273143 memcpy done
PrimusVK-profiling: 0 72219462352 copy queued
PrimusVK-profiling: 0 72219476432 submitting
PrimusVK-profiling: -1 72242020682 Acquire starting
PrimusVK-profiling: 4 72242130788 got image
PrimusVK-profiling: 4 72247533966 Acquire done
PrimusVK-profiling: 4 72257107204 QueuePresent
PrimusVK-profiling: 4 72261649132 memcpy start
PrimusVK-profiling: 1 72283227366 memcpy done
PrimusVK-profiling: 1 72283353334 copy queued
PrimusVK-profiling: 1 72283365181 submitting
PrimusVK-profiling: -1 72305905138 Acquire starting
PrimusVK-profiling: 5 72305956055 got image
PrimusVK-profiling: 5 72306066603 Acquire done
PrimusVK-profiling: 5 72324390347 QueuePresent
PrimusVK-profiling: 5 72327753304 memcpy start
PrimusVK-profiling: 2 72338093247 memcpy done
PrimusVK-profiling: 2 72338207821 copy queued
PrimusVK-profiling: 2 72338217658 submitting
PrimusVK-profiling: -1 72377732749 Acquire starting
PrimusVK-profiling: 6 72377797984 got image
PrimusVK-profiling: 6 72377920706 Acquire done
PrimusVK-profiling: 6 72396288602 QueuePresent
PrimusVK-profiling: 3 72399434393 memcpy done
PrimusVK-profiling: 6 72399726108 memcpy start
PrimusVK-profiling: 3 72402095027 copy queued
PrimusVK-profiling: 3 72402105705 submitting
PrimusVK-profiling: -1 72446361413 Acquire starting
PrimusVK-profiling: 7 72446415871 got image
PrimusVK-profiling: 7 72446524377 Acquire done
PrimusVK-profiling: 7 72464491584 QueuePresent
PrimusVK-profiling: 4 72468703965 memcpy done
PrimusVK-profiling: 7 72470047909 memcpy start
PrimusVK-profiling: 4 72471682773 copy queued
PrimusVK-profiling: 4 72471701472 submitting
PrimusVK-profiling: -1 72516057923 Acquire starting
PrimusVK-profiling: 8 72516193263 got image
PrimusVK-profiling: 8 72523450978 Acquire done
PrimusVK-profiling: 5 72533446026 memcpy done
PrimusVK-profiling: 5 72533564037 copy queued
PrimusVK-profiling: 5 72533573109 submitting
PrimusVK-profiling: 8 72534902381 QueuePresent
PrimusVK-profiling: 8 72540720264 memcpy start
PrimusVK-profiling: -1 72579260855 Acquire starting
PrimusVK-profiling: 9 72579425003 got image
PrimusVK-profiling: 9 72590652352 Acquire done
PrimusVK-profiling: 9 72597776064 QueuePresent
PrimusVK-profiling: 9 72601255001 memcpy start
PrimusVK-profiling: 6 72613831357 memcpy done
PrimusVK-profiling: 6 72613983478 copy queued
PrimusVK-profiling: 6 72613995123 submitting
PrimusVK-profiling: -1 72653830951 Acquire starting
PrimusVK-profiling: 10 72653879241 got image
PrimusVK-profiling: 10 72653985116 Acquire done
PrimusVK-profiling: 10 72672710215 QueuePresent
PrimusVK-profiling: 10 72676005736 memcpy start
PrimusVK-profiling: 7 72689437690 memcpy done
PrimusVK-profiling: 7 72689578354 copy queued
PrimusVK-profiling: 7 72689591752 submitting
PrimusVK-profiling: -1 72720682054 Acquire starting
PrimusVK-profiling: 11 72720813566 got image
PrimusVK-profiling: 11 72724583555 Acquire done
PrimusVK-profiling: 11 72739403615 QueuePresent
PrimusVK-profiling: 11 72744978668 memcpy start
PrimusVK-profiling: 8 72755442182 memcpy done
PrimusVK-profiling: 8 72755493765 copy queued
PrimusVK-profiling: 8 72755518137 submitting
PrimusVK-profiling: -1 72784594975 Acquire starting
PrimusVK-profiling: 0 72784704228 got image
PrimusVK-profiling: 0 72785177090 Acquire done
PrimusVK-profiling: 0 72803473191 QueuePresent
PrimusVK-profiling: 0 72809210434 memcpy start
PrimusVK-profiling: 9 72818670007 memcpy done
PrimusVK-profiling: 9 72818733816 copy queued
PrimusVK-profiling: 9 72818767740 submitting
PrimusVK-profiling: -1 72878556821 Acquire starting
PrimusVK-profiling: 1 72878623293 got image
PrimusVK-profiling: 1 72879356058 Acquire done
PrimusVK-profiling: 10 72888003600 memcpy done
PrimusVK-profiling: 10 72888066081 copy queued
PrimusVK-profiling: 10 72888087784 submitting
PrimusVK-profiling: 1 72896032340 QueuePresent
PrimusVK-profiling: 1 72901569342 memcpy start
PrimusVK-profiling: -1 72960987774 Acquire starting
PrimusVK-profiling: 2 72961083441 got image
PrimusVK-profiling: 11 72974874330 memcpy done
PrimusVK-profiling: 11 72975046513 copy queued
PrimusVK-profiling: 11 72975062217 submitting
PrimusVK-profiling: 2 72988826184 Acquire done
PrimusVK-profiling: 2 72989032221 QueuePresent
PrimusVK-profiling: 2 72996339861 memcpy start
PrimusVK-profiling: -1 73068333524 Acquire starting
PrimusVK-profiling: 3 73068390003 got image
PrimusVK-profiling: 0 73069304720 memcpy done
PrimusVK-profiling: 0 73069438045 copy queued
PrimusVK-profiling: 0 73069452308 submitting
PrimusVK-profiling: 3 73087773993 Acquire done
PrimusVK-profiling: 3 73088275889 QueuePresent
PrimusVK-profiling: 3 73095523108 memcpy start
PrimusVK-profiling: -1 73167817223 Acquire starting
PrimusVK-profiling: 4 73167948595 got image
PrimusVK-profiling: 1 73206610348 memcpy done
PrimusVK-profiling: 1 73206685556 copy queued
PrimusVK-profiling: 1 73206694523 submitting
PrimusVK-profiling: 4 73210322692 Acquire done
PrimusVK-profiling: 4 73210602348 QueuePresent
PrimusVK-profiling: 4 73217756515 memcpy start
PrimusVK-profiling: -1 73317105962 Acquire starting
PrimusVK-profiling: 5 73317249510 got image
PrimusVK-profiling: 2 73330331499 memcpy done
PrimusVK-profiling: 2 73330482015 copy queued
PrimusVK-profiling: 2 73330495535 submitting
PrimusVK-profiling: 5 73339819123 Acquire done
PrimusVK-profiling: 5 73340004058 QueuePresent
PrimusVK-profiling: 5 73346782274 memcpy start
PrimusVK-profiling: 3 73411415755 memcpy done
PrimusVK-profiling: 3 73411603917 copy queued
PrimusVK-profiling: 3 73411616903 submitting
PrimusVK-profiling: -1 73421134775 Acquire starting
PrimusVK-profiling: 6 73421202457 got image
PrimusVK-profiling: 6 73464462438 Acquire done
PrimusVK-profiling: 6 73464569302 QueuePresent
PrimusVK-profiling: 6 73470276665 memcpy start
PrimusVK-profiling: -1 73535871578 Acquire starting
PrimusVK-profiling: 7 73536010721 got image
PrimusVK-profiling: 4 73544760458 memcpy done
PrimusVK-profiling: 4 73544951283 copy queued
PrimusVK-profiling: 4 73544965206 submitting
PrimusVK-profiling: 7 73566121062 Acquire done
PrimusVK-profiling: 7 73566232549 QueuePresent
PrimusVK-profiling: 7 73573079983 memcpy start
PrimusVK-profiling: -1 73657253732 Acquire starting
PrimusVK-profiling: 8 73657379326 got image
PrimusVK-profiling: 5 73660469395 memcpy done
PrimusVK-profiling: 5 73660652750 copy queued
PrimusVK-profiling: 5 73660670001 submitting
PrimusVK-profiling: 8 73685601867 Acquire done
PrimusVK-profiling: 8 73685799239 QueuePresent
PrimusVK-profiling: 8 73692303970 memcpy start
PrimusVK-profiling: 6 73784357282 memcpy done
PrimusVK-profiling: 6 73784655628 copy queued
PrimusVK-profiling: 6 73784682394 submitting
PrimusVK-profiling: 7 73877683146 memcpy done
PrimusVK-profiling: 7 73877740276 copy queued
PrimusVK-profiling: 7 73877747920 submitting
PrimusVK-profiling: 8 73979959592 memcpy done
PrimusVK-profiling: 8 73980143914 copy queued
PrimusVK-profiling: 8 73980155913 submitting

I assume that's all normal. If it's really all copying, then more threads processing images should help? Maybe more memory? I only have 16GB on this laptop, but I could upgrade it.

Somehow I'm not sure Steam uses my primus_vk wrappers. It doesn't output anything specific to Primus in the logs. I also compiled a "wrong" libnv wrapper (I'm on Arch and forgot to change the driver path) and it still started and played my game just fine! Lutris wouldn't even open (the app) if I didn't fix the driver path. Lutris outputs PrimusVK logs.

I don't understand how it could run with proton without primus_vk given bumblebee and vulkan not being supported unless I use primus_vk. Very odd.

felixdoerre commented 5 years ago

Please be aware that more threads for image copying increases the number of images in flight and thereby the latency of your application. Lets unpack what the performance trace says. I've subtracted the base time so the lines get better readable:

PrimusVK-profiling: -1 0ms Acquire starting
PrimusVK-profiling: 10 0.051104ms got image
PrimusVK-profiling: 10 0.193829ms Acquire done
....
PrimusVK-profiling: 10 18.479533ms QueuePresent
PrimusVK-profiling: 10 24.003312ms memcpy start
PrimusVK-profiling: -1 61.812352ms Acquire starting
PrimusVK-profiling: 11 61.869864ms got image

So what's happening here? The application obtains an image to render to from primus_vk. That takes 0.19ms. The application then takes 18ms to submit all render commands on the image. That's pretty long. If the application 18ms per frame, the framerate will not be higher than 55fps.

After that primus_vk submits a present job. It takes until 24ms for that present job to be ready, however the application already has control again, as QueuePresent returns immediately. The application does stuff until 61ms before requesting the next image to render to. That would give us 16FPS. I cannot tell, why the application waits and what it waits for. So I think, the question is: why does the application need so much time between "Acquire done" and "QueuePresent" and between "QueuePresent" and the next "Acquire starting"? So I think this trace shows that the game is wasting most of the time in the mainloop and not primus_vk.

Additionally from the trace we can see that presenting of the previous images takes approximately 3 additional images to be submitted in between (e.g. image 4 finishes while image 7 is submitted), so I'd expect no additional performance gain from increasing copy parallelism.

As to how proton can run without primus_vk: When you use optirun without primus_vk enabled the game should take the integrated GPU (as the other has no presentable surface). So the game should run perfectly fine and ignore the dedicated GPU (which will not even load if the wrapper driver is not correctly installed).

jeromegn commented 5 years ago

That's very interesting. Thanks for the rundown!

I noticed if I set the threads to 4, I get an FPS of about 24 just idling in "town". When I crank it to 8, I get around 30-32 FPS. I'm not sure how that would be affected by actual gameplay with more going on. It does seem to stutter a bit more with more threads, I can see why that would happen.

I'm pretty sure it's proton eating the logs somehow, because I'm sure my Nvidia GPU is in use:

image

Yet, no logs with $ grep -i "primus" ~/steam-238960.log. Maybe they're outputted somewhere else though? I'm running it with PROTON_LOG=1.

So anyway, it sounds like what you're telling me is that the time spent in the game is what's affecting my FPS the most. Adding the overhead of primus_vk (and wine) is likely what's making the FPS bad enough that it's hard to play on Linux. I played on Windows last night and I noticed my FPS wasn't as good as I thought it was. It hung around at 35-50 when there's some action happening. With that in mind, the overhead from primus_vk isn't that bad, but it's just enough to make this hard to play.

I attempted running the game with nvidia-xrun, but wine gave me page faults and vkmark just segfaulted outright. I couldn't test running this game without the primus_vk variable.

bno1 commented 5 years ago

I attempted running the game with nvidia-xrun, but wine gave me page faults and vkmark just segfaulted outright. I couldn't test running this game without the primus_vk variable.

You can try booting your system on the Nvidia GPU so you don't have to use primus or nvidia-xrun. You can find a guide at [1]. One thing that is not explain there is that you have to disable the bumblebee service before rebooting.

@felixdoerre if you plan to optimize the image copying it's maybe worth to have a look at what the devs of LookingGlass did in this regard [2].

Additionally from the trace we can see that presenting of the previous images takes approximately 3 additional images to be submitted in between (e.g. image 4 finishes while image 7 is submitted), so I'd expect no additional performance gain from increasing copy parallelism.

So if I understand this correctly there is a latency of 3 frames between what the rendering GPU outputs and what the presenting GPU outputs? Where is this coming from?

[1] https://wiki.archlinux.org/index.php/NVIDIA_Optimus#Using_nvidia [2] https://forum.level1techs.com/t/level1-diagnostic-fixing-our-memcpy-troubles-for-looking-glass-level-one-techs/127985

felixdoerre commented 5 years ago

@bno1 Any idea to optimize the image-copying is welcome. However I currently don't believe that the copying itself is slow, but the access to the remotely mapped memory from the dedicated GPU. So I'm not sure if anything from the LookingGlass-link is applicable to primus_vk.

Regarding latency: yes, primus_vk increases latency it simply is the time it takes to copy the target frame to the main memory where the integrated GPU can get it rendered. However I don't have exact measures, and a latency of 3 frames is normal for Vulkan applications, I guess. When an application has requested 3 swapchain images, it does not know when vkQueuePresent will really show the image. So with 3 images it's just one image more than regular double buffering. Increasing the number of copy threads increases this latency of course. Not in measures of time, but in measures of frames (as the frames will now get displayed faster).

jeromegn commented 5 years ago

Quick update: The game is actually playable now. Getting around 30-40 FPS in most cases and since this isn't a first person shooter, it's fine.

I noticed I was getting very bad performance as soon as my CPU was doing something else (like firefox tabs doing too much.) If I close firefox or other heavier apps while I play, it's much smoother. I didn't have that issue on Windows because there wasn't as much CPU-intensive tasks going as it is processing graphics. With primus_vk, the CPU needs to be available for the copying and processing of images.

All in all, I'd say this is working great. I get similar framerate with Windows, if I'm careful.