Debug Performance Regressions on 0.20 Vulkan

guusw commented 5 months ago

Hey, I recently tried upgrading from 0.19.1 to 0.20 and noticed I get half the performance using vulkan as I did previously.

I did some profiling and noticed all application code is the same except for one vkQueueSubmit taking up 11ms I suspect something is waiting for an operation to complete which did not happen previously.

It's worth nothing that I perform a readback from a texture, although it happens asynchronously using wgpuBufferMapAsync.

Screenshot 2024-05-30 144610

Below I have provided two renderdoc captures (before/after the upgrade) The rendered items are identical, the only difference I could notice is some fences at the start of the 0.19 capture Screenshot 2024-05-30 150928

Let me know if I can provide any other information to help.

renderdoc.7z.zip

AdrianEddy commented 5 months ago

FWIW, I noticed 2x Vulkan performance decrease but not related to wgpu, but after updating the NVIDIA driver from 535 to 550. Is your GPU driver the same?

guusw commented 5 months ago

516.94, will try with the most recent one.

Wumpf commented 5 months ago

Might also be worth trying against the changes on #5681 We've been meaning to put this in a patch release because there's all sort of problems prior to that PR

vntec commented 5 months ago

I've had similar performance problems with Vulkan and v0.20.0.

After upgrading, I noticed mouse input being very unresponsive in my engine. The NVIDIA overlay reported a render latency of ~120ms at 144FPS (Fifo). The frametimes were similar to v0.19.4 (latency was ~11ms). A similarly high latency occurred at higher framerates with VSync off (Immediate). Initially, I suspected my own engine code, but the same problem occurred when I tested the official wgpu examples. The ash examples did not display this behavior. I'm not entirely sure how NVIDIA exactly measures "render latency", although it seems to be something like "time from render queue to actual presentation".

Upgrading my driver to 555.85 did not help.

I suspect this has something to do with the recent synchronization changes in #5681.

Wumpf commented 5 months ago

I suspect this has something to do with the recent synchronization changes in https://github.com/gfx-rs/wgpu/pull/5681.

any chance you could check against latest trunk? We want to release a patch with that PR soon

LukasKalbertodt commented 5 months ago

I also ran into this problem after updating to v0.20. It doesn't seem like the framerate has decreased, but the latency increased a lot in Fifo mode. I also use wgpuBufferMapAsync but never block on it.

I use an Nvidia RTX 470 with driver version 550.78, on Linux, winit 0.30, whole desktop running at 60fps.

I did update to the latest trunk as suggested, but unfortunately that did not fix the issue.

Noticing latency manually is a bit error prone so I did the following: just print to the terminal whenever an button press is detected. The button press leads to camera motion and thus to a notable change in the rendered image. I then screen captured terminal + wgpu window and finally counted the frames between "terminal prints event" and "effect visible in rendered image". I know this is far from a perfect measurement of latency, but the trend is clear: before the wgpu 0.20 update, there were 2 frames delay, with wgpu 0.20 it's 6 frames, with c7458638d14921c7562e4197ddeefa17be413587 it's also 6.

In Immediate present mode, there doesn't seem to be a problem, I think.

vntec commented 5 months ago

I suspect this has something to do with the recent synchronization changes in #5681.

any chance you could check against latest trunk? We want to release a patch with that PR soon

I just checked against the latest trunk and it does seem a bit better. With VSync, the latency is reduced from ~120ms to ~30ms and it's noticeably more responsive. However it's still not as good as v0.19.4 (11-15 ms). I think the problem still applies to Immediate present mode, but the impact is generally less noticeable since the frametimes and latencies are lower (for simple scenes).

jimblandy commented 4 months ago

We would definitely like to address this, but so far in this issue it's hard to discern how we can actually debug this. A reduced test case that we can run locally to see the problem ourselves would make progress a lot easier.

hecrj commented 3 months ago

Just chiming in to say that I have noticed this regression when working on the upgrade in iced, even after #5681—and it is one of the main reasons the upgrade has not been merged yet (https://github.com/iced-rs/iced/pull/2417).

However, it seems I can only reproduce the regression in debug mode. release builds seem to both have similar performance. Talking about render time, specifically. Not latency.

Wumpf commented 3 months ago

@guusw @vntec did you check on release mode? For Debug we'd only care about regressions if they make things completely unusable for a wide range of trivial usecases - one can always configure cargo.toml such that wgpu is compiled with optimizations even when the rest of the application is not. Compiling wgpu without optimizations is only truly relevant when trying to find issues in wgpu(-core/-hal) itself

hecrj commented 3 months ago

@Wumpf Which packages should I exactly choose to optimize?

I am adding opt-level = 3 to the wgpu crates in 0.20 but still struggling to obtain the same render time as with no optimizations whatsoever in 0.19. That seems a bit odd to me.

Wumpf commented 3 months ago

you might also need to add opt-level = 3 for wgpu-core and wgpu-hal, don't think that setting is transitive

hecrj commented 3 months ago

@Wumpf Ah! Disabling debug-assertions did the trick!

[profile.dev.package.wgpu]
debug-assertions = false

[profile.dev.package.wgpu-hal]
debug-assertions = false

[profile.dev.package.wgpu-core]
debug-assertions = false

[profile.dev.package.wgpu-types]
debug-assertions = false

opt-level does not seem to have much of an impact. Maybe 0.20 introduced a lot more debug_assert! in some hotpath?

hecrj commented 3 months ago

I narrowed it to wgpu-types:

[profile.dev.package.wgpu-types]
debug-assertions = false

This has the biggest impact on performance in debug builds.

EDIT: More experimenting indicates that both 0.19 and 0.20 seem to have same debug render time with debug-assertions disabled.

Disabling debug-assertions for 0.19 has little impact, while 0.20 seems to have more debug checks.

Wumpf commented 3 months ago

nice! thank you for digging into this!!

I'm making the bold move of closing this as won't fix then, if it's just extra debug assertions we're hitting here I'm not concerned about perf :)

@ everyone else: Please re-open if there's reason to believe there's regressions beyond extra debug assertions or it can be shown that certain debug assertions are too excessive.

vntec commented 3 months ago

Thanks for the investigation, everyone! @Wumpf Yes, I was running in release mode, but I'll check if disabling debug assertions does the trick.

EDIT: My render latency problem is gone now. It was probably the synchronization changes interacting with a specific driver version.

janhohenheim commented 3 months ago

@Wumpf could we reopen this? Bevy users are running into it while running their game in debug mode per the linked issue. Usually I would agree that performance is a minor concern for debug assertions, but the change in FPS reported seems staggering: 60 FPS vs 460 FPS for one user and 20 FPS vs 60 FPS for another

hecrj commented 1 month ago

I investigated a bit more and traced the issue down all the way to InstanceFlags::from_build_config:

https://github.com/gfx-rs/wgpu/blob/dfc384a7fd4ab7250a75d59c6f831d9ffb220f7e/wgpu-types/src/lib.rs#L1052-L1058

Specifically, it seems the InstanceFlags::VALIDATION flag enabled in InstanceFlags::debugging is the culprit:

https://github.com/gfx-rs/wgpu/blob/dfc384a7fd4ab7250a75d59c6f831d9ffb220f7e/wgpu-types/src/lib.rs#L1038-L1040

Thus, the issue can be circumvented downstream by manually passing InstanceFlags::empty() to wgpu::Instance::new—at the expense of validation in debug builds.

Any ideas why validation is way more expensive since 0.20?

cwfitzgerald commented 1 month ago

I can't check on my phone, but I think that's about when we enabled synchronization validation by default.

hecrj commented 1 month ago

Can confirm I get ~300 FPS more in the cube example if I comment these lines (enabling PresentMode::Immediate):

https://github.com/gfx-rs/wgpu/blob/dfc384a7fd4ab7250a75d59c6f831d9ffb220f7e/wgpu-hal/src/vulkan/instance.rs#L798-L800

In any case, I have circumvented the problem in iced by introducing a strict-assertions feature flag that can be enabled for internal development; since I see no reason for users of iced to run validation layers unless they are trying to debug a core issue.

This results in an actual speedup on debug builds, after all! Therefore, this issue can be closed on my end.

Wumpf commented 1 month ago

Hmm if it can have such a big impact maybe we shouldn't enable this on all debug builds after all We could have two levels of validation InstanceFlags::VALIDATION and InstanceFlags::VALIDATION_SLOW and only enable the former in debug builds if at all 🤔

cwfitzgerald commented 1 month ago

We already do lol, we only turn on gpu based validation if you ask for "advanced_debugging".

Syncval is a lot more useful to us that GBV, which is why we enabled it.

Wumpf commented 1 month ago

well but synchronization validation doesn't fall under gpu based validation, right? So if we were to not enable that by default we'd have to split it out of regular validation. And judging from this ticket we really should strongly consider not to enable this by default.

gfx-rs / wgpu

Debug Performance Regressions on 0.20 Vulkan #5756