multi_draw_indirect_* on Metal

FredrikNoren commented 2 years ago

Is your feature request related to a problem? Please describe. I'd like to implement GPU-side culling, and from what I understand multi_draw_indirect_* is a key component to this. It would be amazing if this could be supported on OSX as well. (The _count versions would be even better, but as far as I understand it they're not available in Metal? I could be wrong)

Describe the solution you'd like multi_draw_indirect_* available on OSX/Metal

Describe alternatives you've considered I guess for now I can do a manual for-loop and issue one draw_*_indirect call per command (which I would still build on the GPU, for the gpu side culling), to simulate multi_draw*.

Additional context @cwfitzgerald, @jasperrlz and I talked about this yesterday in the matrix chat, figured I'd also post an issue to track this. Also, from the discussion in #742 it sounded like there could be a way to do it?

FredrikNoren commented 2 years ago

Hm so I tried to just add the feature flag to the accepted list of feature flags for Metal, and it seemed to just work (I'm running that on a MacBook Air 2020, big sur): https://github.com/FredrikNoren/wgpu/commit/7da68cb6c9477bebd1a92c647b9fb00822b82e16

kvark commented 2 years ago

This one does indeed work, but the _count versions are not yet implemented in Metal:

unsafe fn draw_indirect_count(
        &mut self,
        _buffer: &super::Buffer,
        _offset: wgt::BufferAddress,
        _count_buffer: &super::Buffer,
        _count_offset: wgt::BufferAddress,
        _max_count: u32,
    ) {
        //TODO
    }
    unsafe fn draw_indexed_indirect_count(
        &mut self,
        _buffer: &super::Buffer,
        _offset: wgt::BufferAddress,
        _count_buffer: &super::Buffer,
        _count_offset: wgt::BufferAddress,
        _max_count: u32,
    ) {
        //TODO
    }

We could try to implement it. Or we could try to split the feature in 2 (count and non-count).

cwfitzgerald commented 2 years ago

Ah I didn't realize we emulated multi-draw-indirect on metal with a for loop of single indirects. Using this idea, the way we could implement MDIC without a cpu readback is by dispatching a compute shader which copies and zeros out the dispatches above the count. This is getting dangerously close to "emulation" and I would rather just split the feature.

Edit: It's already split, there are two features: MDI and MDIC

FredrikNoren commented 2 years ago

To add some more to this issue; this is now my #1 CPU blocker on Metal:

On Windows, the CPU is basically idle for the same scene (~15ms, "drop render pass" doesn't even show up). This is my rendering code:

#[cfg(not(target_os = "macos"))]
{
    render_pass.multi_draw_indexed_indirect(
        &cull_state.commands.buffer(),
        (offset * std::mem::size_of::<DrawIndexedIndirect>() as u64),
        mat.entities.len() as u32,
    );
}
#[cfg(target_os = "macos")]
{
    for i in 0..mat.entities.len() {
        render_pass
            .draw_indexed_indirect(&cull_state.commands.buffer(), (offset + i as u64) * std::mem::size_of::<DrawIndexedIndirect>() as u64);
    }
}

I was looking into if I could try to add Indirect Command Buffers to the metal hal myself, but I'm honestly quite lost. Any pointers or suggestions would be appreciated! Or if there's some other way to reduce the CPU overhead (I was looking at the instance_count field but really not sure how I could use that).

kvark commented 2 years ago

@FredrikNoren thank you for sharing!

The Indirect Command Buffers (ICB) - we wanted to use them as implementation for GPU RenderBundle things. So the road to this would be first adding a feature flag (internally used by wgpu-hal) for native render bundle support, adding the relevant API in wgpu-hal, implementing it on Metal, and then hooking up the real WebGPU's RenderBundle logic to this wgpu-hal feature (if supported). It's a relatively big chunk of work.

Perhaps you could run this workload through Metal's System Trace to see what the sampling profiler reports. Maybe indirect calls on Metal are just that slow? Or maybe we are doing something silly in wgpu-core.

FredrikNoren commented 2 years ago

@kvark That sounds great! Would love to try to see if switching to those RenderBundles would help in the future then. Is there any timeline when that might be available?

I ran it though the Metal System Trace, but to be honest I'm not 100% how to read the results. Here's the trace: https://drive.google.com/file/d/1pqWsBappybinzbDF7PTPD51VddDi7tit/view?usp=sharing

kvark commented 2 years ago

It's hard to put any timeline to this, since there is nobody ready to take on this complex task, and I'm no longer actively implementing new features. I'll try to keep it on my radar for the moment of sudden inflow of free time :) I opened your trace, and unfortunately I can't see much there since there are no symbols.

I see something a little strange though in this picture. It's the encoders. There appears to be many encoders open simultaneously, and closed only at the end of a frame? Perhaps, you could try keeping less of them open?

cwfitzgerald commented 2 years ago

So a bit more context here, there are two different ways that we will interact with metal indirect command buffers in wgpu.

First is through render bundles. These allow the user to record a series of commands on the cpu, then get them replay them multiple times over the course of a program. This maps fairly directly to indirect command buffers and just needs the work done on it, the plan of attack has been set. If you don't need to change the volume or the arguments to the calls, this can be a good solution.

Then is through the use of multi-draw indirect, these allow a shader to write out a series of draw calls that will be executed as a single "blast" of draw calls. These are most useful in cases of gpu culling or other places where the gpu wants to generate a possibly arbitrary amount of draw calls to be executed. Right now, the design of wgpu's multi-draw indirect requires that a compute shader be added in the metal backend which translates the vulkan style multi-draw-indirect buffer into the metal indirect command buffer. This is far from the most efficient implementation, and I would love to see a world where we implement something that lets shaders write directly to ICBs.

kvark commented 2 years ago

@cwfitzgerald side note - so that's where your proposal comes in? With an indirect command encoder being an API primitive, we'll be able to implement it purely on host side without writing from compute shaders. That seems like a solid argument, although not sure if it's enough to warrant the API complication.

FredrikNoren commented 2 years ago

Writing to ICBs from shaders sounds great!

In the meantime; is there any way to circumvent webgpu to get access to metal directly, or would a fork be my best option?

FredrikNoren commented 2 years ago

@kvark Re. the many encoders; I think it's actually just two encoders (I just open two in the code), but what's showing is each render pass. Also not sure why symbols doesn't show up, it's built with debug = true (but in release mode)

cwfitzgerald commented 2 years ago

@kvark we already have ICBs on the host side: renderbundles (we haven't implemented it yet, but we can)

My proposal was to add a multi-draw-indirect-like ICB as a wgsl buffer type so that a user's compute shader can freely write to it. This would let vulkan, dx12, and metal resolve to different ICB code (vulkan "normal" MDI buffer, dx12 a MDI buffer with 3 push constants, and metal an actual ICB). This also lets us do the bounds checking during the write, without needing a separate compute shader to validate.

gfx-rs / wgpu

multi_draw_indirect_* on Metal #2148