bkaradzic / bgfx

Cross-platform, graphics API agnostic, "Bring Your Own Engine/Framework" style rendering library.
https://bkaradzic.github.io/bgfx/overview.html
BSD 2-Clause "Simplified" License
14.59k stars 1.92k forks source link

Tracy Integration. #3308

Closed mcourteaux closed 3 weeks ago

mcourteaux commented 3 weeks ago

Vulkan screenshot: image

OpenGL screenshot: image

Tracy associates GPU zones to CPU zones:

image

In the above screenshot, I'm hovering on the rendererSubmit GPU zone with the cursor, and it highlights the time-zone from where this came in the CPU timeline.

bkaradzic commented 3 weeks ago

How do you build & run client?

mcourteaux commented 3 weeks ago

That's your task, separately. When using Tracy, you typically clone it and compile the UI once, and install it. Compiling the UI is quite a long process, and makes a big build folder. I use Tracy in several projects, and I have one compiled Tracy UI installed on my system. An explanation in the bgfx documentation for that is probably welcome.

But, to be complete: this is documented in the Tracy Manual, in section 2.3 "Building the server". Note that the Tracy UI (what you called the client) is in the Tracy project referred to as "the server", which is confusing, because it doesn't run a server from a networking perspective.

bkaradzic commented 3 weeks ago

What's preferred integration is that bgfx takes all responsiblity of collecting data, and then have way to pass all that data to Tracy or other profiler, without any modifications to bgfx.

Most of stuff that this function implements is already available in bx+bgfx and no need to use Tracy's implementation of it.

tracy_force_inline void __bgfx_tracy_vulkan_begin_literal( tracy::VkCtx* m_ctx, const tracy::SourceLocationData* srcloc, VkCommandBuffer cmdbuf)
{
    using namespace tracy;
    const auto queryId = m_ctx->NextQueryId();
    CONTEXT_VK_FUNCTION_WRAPPER( vkCmdWriteTimestamp( cmdbuf, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, m_ctx->GetQueryPool(), queryId ) );

    auto item = Profiler::QueueSerial();
    MemWrite( &item->hdr.type, QueueType::GpuZoneBeginSerial );
    MemWrite( &item->gpuZoneBegin.cpuTime, Profiler::GetTime() );
    MemWrite( &item->gpuZoneBegin.srcloc, (uint64_t)srcloc );
    MemWrite( &item->gpuZoneBegin.thread, GetThreadHandle() );
    MemWrite( &item->gpuZoneBegin.queryId, uint16_t( queryId ) );
    MemWrite( &item->gpuZoneBegin.context, m_ctx->GetId() );
    Profiler::QueueSerialFinish();
}

Instead integrating Tracy or other profilers directly into bgfx.

bkaradzic commented 3 weeks ago

I already had similar style integration with Remotery (see: https://github.com/bkaradzic/bgfx/commit/f0971eda83bfbf0f14606c2e10bd2be6920102f3), and I found that it's mistake to do deep integrations like that. It becomes too dependent on specific profiler, and it's inflexible. That's why I think it's better to collect all data by bgfx, and then pass it to whatever profiler user chooses to use.

mcourteaux commented 3 weeks ago

What's preferred integration is that bgfx takes all responsiblity of collecting data, and then have way to pass all that data to Tracy or other profiler, without any modifications to bgfx.

I'm a bit puzzled by your concern. The goal was to not change bgfx, which I didn't for a very few very specific places. Of course if you want to optionally enable something and let the user choose, you will need a few macros. The place where some more specific glue-code is needed, is nicely tucked away far outside of the main bgfx codebase. Do you want to delete Tracy? Just delete the folder and the GENIE option and the 4 macros in the code that conditionally enable it. That's how lightly coupled I tried to make it.

That's why I think it's better to collect all data by bgfx, and then pass it to whatever profiler user chooses to use.

I would like to bring to your attention that it's absolutely non-trivial to have "non-deep integration" with any GPU-aware profiler, and still have the profilers correctly link the CPU and GPU zones. Tracy registers IDs for the GPU timer queries, and associates those with the timed scope of the CPU work. Imagine you want to forbid these "deep" integrations: how would you be able to correctly associate CPU and GPU time for any profiler framework? One potential "generic interface" might work for one framework, but might require a whole lot of workaround-bookkeeping to manage the information correctly before it can be sent to the profiler framework, simply because it didn't fit the style the data is recorded and passed in the "generic interface".

In general, I think the 3 macros: PROFILER_(BEGIN; BEGIN_LITERAL; END) are a decent abstraction, and you can tell by the fact that this works, because your profiler interface, Tracy, and Remotery all fitted it. That's why I think a profiler-specific integration is useful: it can be mapped to the macros, but the macros can hide a lot more interesting behavior that would be otherwise hard (cpu-gpu association) or impossible (not copying over names, and line numbers, and filenames, etc in case of begin_literal) to achieve. For example, Tracy uses a highly optimized strategy of communicating which scope is referred to. That is possible due to macros that cleverly insert static constexpr that contains source code location information, which Tracy sends over the network connection once, and afterwards refers to it using the address of the static variable. Tricks like this require deep integration.

Additionally, I invite you to take a look at the work that Tracy does regarding calibrating the GPU timestamps in the Vulkan implementation (public/tracy/TracyVulkan.hpp). That's the sort of work I wouldn't want to have to rethink and reimplement.

Regarding the code fragment you highlighted: this is just a copy-paste from a part of the Tracy code, but adapted painfully, because bgfx doesn't nicely use scoped blocks to be profiled. Making everything profiled with scoped profiling blocks would remove pretty much all of the glue code I added now (which you highlighted). The reason I did that is because the goal was to not change bgfx.

It's very easy to make a few changes to bgfx and have the profiling follow the scoping approach; in which case stuff would become much simpler.

Most of stuff that this function implements is already available in bx+bgfx and no need to use Tracy's implementation of it.

Well, someone would have to do it. Tracy is an excellent profiler in my opinion, and I think that using it makes sense. Reinventing the wheel here is a lot more work than this 2 days I worked on this.

bkaradzic commented 3 weeks ago

I would like to bring to your attention that it's absolutely non-trivial to have "non-deep integration" with any GPU-aware profiler, and still have the profilers correctly link the CPU and GPU zones.

CPU/GPU "linked" profiling already exist here: https://github.com/bkaradzic/bgfx/blob/b66f60cba0ec49873cfb1cbedffac0eca38250d8/src/renderer.h#L497-L532

That is possible due to macros that cleverly insert static constexpr that contains source code location information, which Tracy sends over the network connection once, and afterwards refers to it using the address of the static variable. Tricks like this require deep integration.

That doesn't require deep integration. _filePath and _line are always literals. And bgfx also has concept where name is literal too.

https://github.com/bkaradzic/bgfx/blob/b66f60cba0ec49873cfb1cbedffac0eca38250d8/include/bgfx/bgfx.h#L490-L526

Well, someone would have to do it. Tracy is an excellent profiler in my opinion, and I think that using it makes sense. Reinventing the wheel here is a lot more work than this 2 days I worked on this.

This is not argument about Tracy. It's about integration with bgfx...

mcourteaux commented 3 weeks ago

That doesn't require deep integration. _filePath and _line are always literals.

That still doesn't allow you to do optimized Tracy profiling. Tracy's ZoneScopedNC does this:

#define ZoneNamedNC( varname, name, color, active )                                          \
   static constexpr tracy::SourceLocationData TracyConcat(__tracy_source_location,TracyLine) \
      { name, TracyFunction,  TracyFile, (uint32_t)TracyLine, color };                       \
   tracy::ScopedZone varname( &TracyConcat(__tracy_source_location,TracyLine), active )

You see how the information about what is profiled, is kept separately in a static constexpr SourceLocationData, which is later only referenced by address.

This sort of trick, you can't do without macros expanding into something more interesting than a virtual function call.

CPU/GPU "linked" profiling already exist here:

I don't know what that does, as it looks conceptually broken... How can a GPU timer be ended, and the result immediately stored in a ViewStats struct? The Vulkan backend has 3 frames in flight, and so when a timer is ended on CPU, it'll end only 50ms later on GPU. This can't work...


What do you suggest? Define an interface that will work with any profiler you'd like to hook up, that doesn't require profiler-specific macros, and I'll make it.