Open nical opened 4 months ago
I ran the renderpass benchmarks through superluminal (with [profile.bench] debug=1
in Cargo.toml
), and all logging seems to be completely disabled during them, so I'm not sure how useful they are for diagnosing wgpu's logging overhead.
and all logging seems to be completely disabled during them
they are still compiled in though? I assumed so far that the entire logging performance discussion is on the premise that most logs are not rendered (either displayed on console or written to disk). Log rendering performance depends on what logger to use (e.g. env_logger) and what target (some terminals are more horribly slow than others). I.e. we want to measure the overhead of the the calls to/interaction with the log
crate and this should be included in here, yes?
I checked that they are indeed compiled in. They should be useful if the workload is a good enough proxy for a game (so it would still be useful to check using bevy itself).
I assumed so far that the entire logging performance discussion is on the premise that most logs are not rendered
That is my understanding as well.
I assumed so far that the entire logging performance discussion is on the premise that most logs are not rendered
All the bevy performance results have been with only info
level logs enabled, and with wgpu's logs filtered out, so there's almost no logging output (basically just device info at the start, and maybe outputting an fps counter once a second).
All the bevy performance results have been with only info level logs enabled
Assuming you mean info
level enabled at runtime (log::max_level() >= Info
) and none of the log levels filtered out at compile time (log::STATIC_MAX_LEVEL == Trace
). Sounds like we are on the same page. That matches wgpu's benchmark configuration as far as trace-level logs are concerned at least.
A lot of logging was removed in #6065, so it'll be important to pick a recent revision of wgpu for new measurements. Turns out d3d12 had a fair amount more logging than the other backends so it might explain discrepancies between results measured by different people (I've been benchmarking the vulkan backend on linux so far). Backends should produce similar more similar volume of logs now.
I have previously observed that logs which are ultimately filtered out but which aren't filtered out by the global log::max_level()
can be much more expensive. The significant overhead of filtered out logs seen with bevy led me to suspect this was happening.
The benchmarks in wgpu don't include a logger, so the max_level
will default to OFF
and should filter out all logs fairly efficiently. Since I suspected the significant overhead might be due to missing this max_level
filtering, I sought to confirm this by reproducing the logging configuration from the bevy report. However, it wasn't clear too me what the logging configuration was there. So my best guess was to use bevy's current default logging setup and see what that looks like.
I took https://github.com/bevyengine/bevy/blob/0070bdccd873e186a9ccec3ecb6f1d3d83b8710b/crates/bevy_log/src/lib.rs, removed the irrelevant portions, and simplified it to:
use tracing_subscriber::{prelude::*, registry::Registry, EnvFilter};
let filter = "wgpu=error,naga=warn".to_string();
let level = tracing::Level::INFO;
let default_filter = { format!("{},{}", level, filter) };
let filter_layer = EnvFilter::try_from_default_env()
.or_else(|_| EnvFilter::try_new(&default_filter))
.unwrap();
let fmt_layer = tracing_subscriber::fmt::Layer::default().with_writer(std::io::stderr);
let subscriber = Registry::default().with(filter_layer).with(fmt_layer);
tracing_log::LogTracer::init().unwrap();
tracing::subscriber::set_global_default(subscriber).unwrap();
Notably, LogTracer::init()
is the interface between log
and tracing
and it isn't configured with a max level (this can be derived from the filter). Ignoring cases where the user's filter selectively toggles on trace for something unrelated (which would set the global max level to that), I suspect fixing this particular aspect would significantly alleviate the issue.
To test this out, I did some benchmarks starting with the commit before the log cleanup in #6065
Using:
cargo bench -p wgpu-benchmark -- renderpass
(0) 7ff80d65fcf099afa4b87a953487805bfdd058e5
(1) 7ff80d65fcf099afa4b87a953487805bfdd058e5 with logging setup to match bevy defaults (compared against 0):
(2) 1 with max level passed to LogTracer
(compared against 1):
NOTE: The benchmarks varied a bit for me and I didn't put any effort into creating a stable benchmarking environment. I.E. I've see differences as significant as -10% in other runs comparing 1 vs 2.
(3) 9c6ae1beae2490ce44d99034f7f1faada936f3d8 with changes from 1 (compared against 1)
Here we look at the impact of the log cleanup when logs aren't being filtered out immediately using the max level.
(4) 1 with trace level logs statically disabled (compared against 1)
I think it would be interesting to see the impact of the change from 2 in bevy's many_cubes
example and how it compares to statically disabling trace level logging.
Going back to the topic of selectively enabling trace level logs. In my previous investigations, tracing
's event generation handles this a lot better than log
because the filtering result is cached in a per call site static. I think it would be worth considering the use of tracing
in wgpu to mitigate the performance impact of enabling trace level logs for unrelated code. I noticed that tracing_log
has a interest cache that users can enable to do some caching of the filtering. I haven't experimented with this, but since that has to occur behind a function call and is otherwise limited by not being part of the logging macro, I suspect there is still some noticeable overhead compared to using tracing
directly.
Debug builds being orders of magnitude slower than release builds, I believe it is safe to assume trace-level logging overhead to be negligible (unless someone has benchmarks to show otherwise).
In my experience, the typical slowness of the default debug build configuration leads to enabling optimizations for the debug/development profile to make things actually usable. So I don't think the slowness of default debug builds can be relied on to make the logging overhead negligible.
Thanks @Imberflur! Sounds like there is some digging to do on the bevy side to configure the tracing/logging in the most efficient way.
I'm operating under the assumption that overhead associated with logs that are filtered out by log::max_level()
are wgpu's responsibility, but if log::max_level()
is not configured to filter out trace-level logs, then the associated overhead should be fixable by the downstream crate/app.
I'll rebase #6010 and see if it makes a difference on top of your init_logging
function (with LogTracer::init_with_filter(tracing_log::log::LevelFilter::Info)
).
Going back to the topic of selectively enabling trace level logs. In my previous investigations, tracing's event generation handles this a lot better than log because the filtering result is cached in a per call site static. I think it would be worth considering the use of tracing in wgpu to mitigate the performance impact of enabling trace level logs for unrelated code.
Interesting. @ErichDonGubler recently suggested switching to tracing
because it has more bells and whistles. The other maintainers felt that wgpu didn't need the extra capability, but if we get free performance out of switching, then it sounds like it would make sense to do so.
Context
Bevy folks measured that wgpu's trace-level logging when enabled at build time but disabled at run time can have a significant overhead (15%). This overhead comes from checking a global atomic variable that defines the log level.
In #6017, @Elabajaba proposes gating api logs and resource logs (trace level by default) behind a build flag.
I'm not fond of adding another feature flag for this, so I would like to explore other options first.
Disable trace level logs at build time
The most straightforward option is to note that trace level logging is the most verbose out of 5 logging levels. If the most verbose logging level is too verbose for a bevy application, then bevy could disable trace-level logging in optimized builds. This removes the performance concerns for bevy while letting its dependencies and bevy itself have a log level that has potentially verbose logs that are useful for development or for debugging particularly tricky issues. Firefox, for example, disable both trace and debug log levels in optimized builds by default for the same reason.
Debug builds being orders of magnitude slower than release builds, I believe it is safe to assume trace-level logging overhead to be negligible (unless someone has benchmarks to show otherwise).
I'm tempted to leave it at that. That said, 15% overhead is a lot and there should be ways to mitigate some of it without requiring more build configurations on the wgpu side.
Reduce the overhead of some of the logs
For context, wgpu's repository has a few benchmarks. Removing all trace-level logs yields improvements ranging from 0 to 5% on these benchmarks. This doesn't mean the that reducing the logging overhead will improve bevy by at most 5%, it only means that the highest improvement we can observe through these benchmarks is in that range. We'll have to scale these numbers a bit to transpose to other projects and getting some performance measurements from bevy itself would be valuable.
Resource logs
Resource logs only happen when creating and destroying API objects. For a typical application or game that's fairly rare (in the order of a few of times per frame). Not worth optimizing. Removing them entirely does not affect wgpu's benchmarks.
API logs
API logs can be very verbose. Most of the API log traffic in a typical frame would be in render and compute passes with one log per command (and games tend to submit a lot of commands per render pass). In #6010 I proposed to check the log level once at the beginning of the render pass instead of once per command. The idea was that checking a local variable instead of a global atomic at a high-ish frequency should be faster. I ran the wgpu's render pass benchmarks on two computes and the numbers (1) are puzzling. It is a mixed bag of improvements and regressions in the [-1.4% .. +1.7%] range.
Removing the api logs entirely was just as puzzling. The results (3) show either no changes or regressions.
Why? I don't know, I'm curious. It's easy to speculate but digging into it would take time. I ran the measurements multiple times on two computers and the results are consistent.
Other trace-level logs
There's a fair amount of trace-level logging outside of api and resource logs. Some of them might not be useful anymore and could be removed.
I'll open a PR to remove some of it and maintainers can chime in to "save" any log they care about. We already discussed keeping api, resource and initialization related logs.
Conclusions
Hard to draw conclusions at this point other that optimized builds are faster without trace-level logs, but that focusing on resource and api logs yields inconclusive results.
Measurements
Focusing on the render pass benchmarks here since other benchmarks were mostly not affected by the changes.
(1) Disable all trace-level logs
(2) Reduce the overhead of api logs in render passes
By checking the global log level once per render pass instead of once per command (#6010).
(3) Remove api logs at build time