Lifetimes on `RenderPass` make it difficult to use.

aloucks commented 4 years ago

For context, I was looking at updating imgui-wgpu to work with the current master.

Now that set_index_buffer (and similar methods) take the a &'a Buffer instead of &Buffer, it has a ripple effect of "infecting" the surrounding scope with this lifetime.

While attempting to iterate and add draw calls, you end up with either "multiple mutable borrow" errors on wherever you're storing the Buffer or similar lifetime errors like "data from self flows into rpass here"

I would think that by calling set_index_buffer (and similar methods) that it would increment the ref-count on the buffer so that this lifetime restriction isn't needed.

kvark commented 4 years ago

This is indeed a feature that makes it more difficult to use. What we gain from it is very lightweight render pass recording. Explanation follows.

wgpu is designed to fully support multi-threading. We do this by having whole storages of different objects locked upon access. So generally, touching anything has a CPU cost. If we had to access each of the refcounts of the resources used by render commands, we'd be paying that cost during the render pass recording, which is hot code. Now with the recent changes, recording a pass is very lightweight, no locks involved.

Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic. Generally, unless you are creating many resources during recording, it's easy to work around. If you are doing that, you aren't on a good path performance wise, and should consider creating one big buffer instead per frame.

Another way to possibly address this is to have wgpu-rs ensuring the lifetimes of the objects by other means, like keeping a refcount (note: talking about wgpu-rs here, not wgpu, which already keeps a refcount, but we'd need to lock an object storage to access it). This is far fetched and not something I'm excited about :)

head's up to @mitchmindtree, who is on 0.4 and will face this issue soon. It would be good to know how much this would affect their case.

aloucks commented 4 years ago

Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.

I think the issue might be a bit more severe. I put the buffers into a Vec that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) from set_index_buffer, the vec lifetime becomes "linked" to the RenderPass<'render> lifetime in a mutable borrow. This prevents accessing it again.

mitchmindtree commented 4 years ago

Judging by gfx-rs/wgpu-rs#155 and gfx-rs/wgpu-rs#168 I don't imagine it should affect us a great deal - most of the abstractions in nannou take a &mut CommandEncoder and encode the whole render pass within the scope of a single function, e.g.

The UI render pass https://github.com/nannou-org/nannou/blob/master/src/ui.rs#L451
The Draw API render pass https://github.com/nannou-org/nannou/blob/master/src/draw/backend/wgpu/mod.rs#L179
The TextureReshaper render pass https://github.com/nannou-org/nannou/blob/master/src/wgpu/texture/reshaper/mod.rs#L117

These are just a few examples - generally all of these are submitted on a single CommandBuffer once per frame. Most nannou users might not use all these abstractions in a single sketch/app though.

Anyway, I hope I'm not speaking too soon as I haven't tried updating yet. There are some other things I'd like to address in nannou first, but I'll report back once I get around to it.

kvark commented 4 years ago

Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.

I think the issue might be a bit more severe. I put the buffers into a Vec that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) from set_index_buffer, the vec lifetime becomes "linked" to the RenderPass<'render> lifetime in a mutable borrow. This prevents accessing it again.

Yes. So the good news is - this is a not the best pattern to follow as a use case: creating buffers as you are recording a pass. Would it be possible for you to refactor the code in a way that first figures out how much space is needed for, say, all indices in a pass, creating a single buffer, and then using it through the pass?

cart commented 4 years ago

I also hit this issue in the game engine I've been building. It has a relatively complex "Render Graph" style api and I spent a solid ~15 hours over the last week refactoring everything to account for the new lifetime requirements. In general I solved the problem the same way @kvark outlined. Although I really wish I had found this thread before coming to the same conclusion via trial and error :smile: .

My final solution was:

separate the "gpu resources" container out from the renderer container. i'm not sure this was absolutely necessary, but it helped me reason about the lifetimes. i also think its a better design.
do all resource creation / write operations before starting a render pass. Previously I allowed my "render pass" / "draw target" abstractions to create resources during the execution of a render pass. i broke this up into a "setup" phase and a "draw" phase. I removed all mutability from the "draw" phase.

My takeaways from this (honestly very painful) experience:

I like to think I have a strong understanding of rust lifetimes and I consider myself "proficient" at handling most lifetime issues. In general I find the compiler's lifetime error messages extremely helpful, but this class of lifetime error was almost impossible to debug. I wasn't able to solve the problem via the hints from the compiler as I normally do, but rather via a vague notion of the "intent" behind the new lifetimes and what that could mean for my program.
Breaking changes to a dependency's lifetimes can have massive effects on a project's architecture. This is especially problematic when there are multiple lifetime changes in the dependency's api because you don't see all lifetime errors when a compile fails. Instead you need to solve the first problem the compiler hits (sometimes via re-architecting), then the second, etc until all lifetime issues have been accounted for. I was forced to re-architect multiple times because at each attempt, I had an incomplete picture of the lifetime constraints I was solving for. The fact that the compiler can't give me a "full picture" of the lifetime constraints I need to solve actually makes me really uneasy about updating rust dependencies and accounting for api changes. Maybe the rust compiler could be improved somehow to make this type of problem solving easier? But I honestly have no clue what that would look like. This "constraint solving" problem would also exist if I was starting my project from scratch on the current wgpu-rs master branch. I honestly just hope this is a niche problem that I never see on this scale again.

That being said, I understand why these lifetime changes were made and I think the wgpu-rs team made the right call here. I certainly don't want this to read as a complaint. I deeply appreciate the work @kvark and the wgpu team has done here. I just want to add my experience as a data point.

kvark commented 4 years ago

Thank you for feedback @cart ! Just wanted to add that this is all being evaluated. We aren't completely sure if these lifetimes are a good idea. It's certainly the easiest for wgpu to work with, but I totally agree that it could cause headaches for the users... and it does. The good thing here is that wgpu-rs is just a Rust idiomatic wrapper around wgpu, which is a C API and it doesn't have explicit lifetimes (although, same lifetimes are required implicitly). So what we could do is having others pass variants, e.g. ArcRenderPass and ArcComputePass, which would work similarly but receive Arc<> in their parameters and store the references inside, e.g.:

struct ArcRenderPass<'a> {
    id: wgc::id::RenderPassId,
    _parent: &'a mut CommandEncoder,
    used_buffers: Vec<Arc<Buffer>>,
}

impl ArcRenderPass<'_> {
  fn set_vertex_buffer(&mut self, slot: u32, buffer: &Arc<Buffer>, offset: BufferOffset) {
    self.used_buffers.push(Arc::clone(buffer));
    unsafe {
            wgn::wgpu_render_pass_set_vertex_buffer(
                self.id.as_mut().unwrap(),
                slot,
                buffer.id,
                offset,
            )
        };
  }
}

These passes could be used interchangeably with the current ones and trade the life time restrictions to a bit of run-time overhead for the Arc. We could go further and try to encapsulate the thing that keeps track of the resources, which you can only append to. There is a lot of ways to be fancy and lazy here :)

cart commented 4 years ago

Ooh I think I like the "multiple pass variants" idea because it gives people the choice of "cognitive load vs runtime cost". The downsides I can see are:

larger api surface
it raises questions like "which pass variant do you put into tutorials" / "what variant should you steer people to by default"
educating users about why there are two and when they should choose one over the other. this lifetime problem is hard to wrap your head around until you have a full understanding of the system.

On the other hand, the "zero cost abstraction" we have currently feels more in line with the Rust mindset and I'm sure many people would prefer it. I'm also in the weird position where I'm over the "migration hump" and now I really want a zero cost abstraction. Its hard for me to be objective here :smile:

I think this could be solved with either:

multiple pass variants and docs that make it clear to newbies and experts what path they should take.
documentation that explains what the "zero-cost" lifetimes mean for programs and examples that illustrate "updating resources across frames within a shared "gpu resource collection" ".
a documentation note somewhere that if the pass lifetimes are too difficult, users can always break glass and use the "wgpu" C api directly.
some combination of (1), (2), and (3)

If I had to pick one today, I think I would go for (2). Rather than complicating the api surface / being forced to support that forever and document it clearly, just see if additional docs and examples for the "zero cost lifetimes" solves the problem well enough for users. If this continues to be a problem you can always add the variant(s). Removing apis is harder on users than adding features, so I think it makes sense to bias toward a smaller api.

aloucks commented 4 years ago

The other interesting aspect is that in this use case, we don't care about the Buffer lifetime in terms of it's memory location. We only care that the Drop impl does not run. The difference is subtle but it opens up some other possibilities. For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.

While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc? Are there guards in WGPU core/native that protect against this? If so, does it make sense to have guards in this case, but not in the case of a buffer being dropped before the pass is finished recording?

kvark commented 4 years ago

For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.

Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).

While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc?

Generally, we have all the objects refcounted, and you don't lose the device just because you drop it. The only exception really is render/compute pass recording, where we only want to work with ID and not go into the objects themselves (until the recording is finished) to bump the refcounts.

dhardy commented 4 years ago

Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).

This would appear the simplest option to me. It can probably even be done without breaking changes:

pub enum BufferOwnedOrRef<'a> {
    Owned(Buffer),
    Ref(&'a Buffer),
}

impl<'a> From<Buffer> for BufferOwnedOrRef<'a> {
    fn from(b: Buffer) -> Self {
        BufferOwnedOrRef::Owned(b)
    }
}

impl<'a> From<&'a Buffer> for BufferOwnedOrRef<'a> {
    fn from(b: &'a Buffer) -> Self {
        BufferOwnedOrRef::Ref(b)
    }
}

pub fn set_vertex_buffer<'a, B: Into<BufferOwnedOrRef<'a>>(
    &mut self,
    slot: u32,
    buffer: B,
    offset: BufferAddress,
    size: BufferAddress
)

kvark commented 4 years ago

@dhardy yes, we could. I hesitate, however, because I see the value in not promoting the code path where the user creates resources in the middle of a render pass. It's an anti-pattern. The only reason that could make this path appealing today is because updating GPU data is hard.

Here is what needs to happen (ideally) when you are creating a new vertex buffer with data:

a new chunk of staging/IPC memory is linearly allocated for the data
the data is filled in or copied over to that staging chunk
a piece of GPU memory is allocated for the data
a copy operation is encoded and enqueued, it copies from staging to the GPU memory

Now, imagine you already have a buffer that is big enough(!). That would spare you (3) but otherwise follow the same steps. Therefore, there is no reason for us to make it easy to create new buffers, even if you are replacing all the contents of something. It's always more efficient to use an existing one.

The only caveat is - what if you need a bigger buffer? Let's see if this becomes a blocker.

For the data uploads, the group is still talking about the ways to do it. Hopefully, soon...

pythonesque commented 4 years ago

FYI, you can emulate the ArcRenderPass API using arenas in user space, and it should basically be just as efficient as the equivalent WGPU API (unless core implementation details change a lot to increment internal reference counts before the RenderPass is dropped).

struct ArcRenderPass<'a> {
    arena: &'a TypedArena<Arc<Buffer>>,
    render_pass: RenderPass<'a>
}

impl<'a> ArcRenderPass<'a> {
  fn set_vertex_buffer(&mut self, slot: u32, buffer: Arc<Buffer>, offset: BufferOffset) {
    let buffer = self.arena.alloc(buffer);
    self.render_pass.set_vertex_buffer(slot, buffer, offset);
  }
}

fn blah<'a>(encoder: &'a mut CommandEncoder) {
    let arena = TypedArena::new();
    let arc_render_pass = ArcRenderPass {
        arena,
        render_pass: encoder.begin_render_pass(..),
   };
   // ... Do stuff; you can pass around &mut ArcRenderPass and call set_vertex_buffer on owned `Arc`s.
}

kvark commented 4 years ago

@pythonesque it would be wonderful if we had that used by one of the examples. Would you mind doing a PR for this? We'd then be able to point users to working code instead of this snippet.

ghost commented 4 years ago

Just to provide another data point, I hit this issue as well.

Consider that I want my user to be able to simply call an API to render high-level objects without worrying about details of which buffers to use. There are 2 options: 1) Define a buffer size upfront. If they exceed it at any point when issuing a high-level render call, internally end the current render pass, and then set up the same render pass again so we can reuse the same buffer.

2) Instead of ending the render pass, we create buffers dynamically within the same render-pass to avoid setting up identical states.

I'm not sure which is more performant. With option (1), it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism (unless I force the user to go full async and/or use double-buffering). With option (2), we're infinitely-buffered but pays for the cost of allocations.

Ultimately, with the current lifetime constraints, option (2) is not possible. So we're forced to go for option (1).

As a side point, it is a little clunky using the current buffer mapping API to go with option (1). I referred to gfx-rs/wgpu-rs#9 and saw this advice from @kvark:

That's why the current rough and effective way to update data is to go through create_buffer_mapped.

which seemed to contradict the approach to its core.

kvark commented 4 years ago

@DefinitelyNotRobot I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:

it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism

Also, this part:

As a side point, it is a little clunky using the current buffer mapping API to go with option (1).

This issue gfx-rs/wgpu-rs#9 is actually no longer a problem. The upstream WebGPU API went in this direction, and it's a part of wgpu-0.6.

Did you consider using the TypedArena<Arc<wgpu::Buffer>> and stuff like https://github.com/gfx-rs/wgpu-rs/issues/188#issuecomment-631143941 suggests?

ghost commented 4 years ago

I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:

Sorry about that. I was trying to illustrate the 2 designs that I could go with my API + their pros/cons and meant to say that option (2) was not even considerable because of RenderPass's lifetime constraints.

Did you consider using the TypedArena<Arc<wgpu::Buffer>> and stuff like gfx-rs/wgpu-rs#188 (comment) suggests?

I was hesitant because that would mean an allocation for TypedArena every render call but on second thought, it seems I could haul it out and store it in my renderer object instead. I'll give that a try.

birbe commented 2 years ago

What's the status of this? I think it would be very useful. I'm writing a renderer that would allow users to create custom pipelines, and they'd be provided with the state of the renderer using a trait. The issue is that I'm getting massive lifetime issues that are caused by the fact that I'm using a lot of RwLocks in my renderer state, meaning that within that trait, I cannot give RenderPass data from within a RwLock guard, as that data gets dropped at the end of the trait. If I had a way to instead pass in Arcs of buffers, this would clear up all of my issues.

kvark commented 2 years ago

At this point, any improvement here should be blocked on @pythonesque work in Arc-anizing the internals.

cwfitzgerald commented 2 years ago

This is now blocked on #2710.

swiftcoder commented 2 years ago

This is still pretty painful. The interactions with this lifetime mean that any object which embeds wgpu resources becomes untouchable the instant you record one of those resources into a renderpass. And since we can't clone wgpu resource ids in rust today, there's no way to work around this with temporary handles on the application side.

For a non-trivial application, I'm ending up having to wrap all the wgpu resource types in wrapper objects, define new handles for them, and pass those around in my code...

birbe commented 2 years ago

This is still pretty painful. The interactions with this lifetime mean that any object which embeds wgpu resources becomes untouchable the instant you record one of those resources into a renderpass. And since we can't clone wgpu resource ids in rust today, there's no way to work around this with temporary handles on the application side.

For a non-trivial application, I'm ending up having to wrap all the wgpu resource types in wrapper objects, define new handles for them, and pass those around in my code...

Adding onto this, I had to write a custom un-typed arena that would allocate Arcs onto the heap and return a reference to the data instead of being able to just pass in an Arc into the render pass

griffi-gh commented 1 year ago

This is the reason I gave up on wgpu.

mikialex commented 1 year ago

@griffi-gh I was working on a wgpu encapsulate layer to work around these issues, which may be useful for your reference. Although this lifetime requirement imposes great constraints in renderpass encoding for performance and safety reasons, I still think the wgpu is a super good graphics abstraction layer implementation.

https://github.com/mikialex/rendiation/tree/master/platform/graphics/webgpu

cwfitzgerald commented 1 year ago

Y'all will be happy to know that this requirement will be able to be removed once #3626 lands.

Aaron1011 commented 11 months ago

Now that https://github.com/gfx-rs/wgpu/pull/3626 has been merged, what's the next step for this issue? I might be interesting in working on this.

swiftcoder commented 11 months ago

The main task is to refactor RenderPass to try and get rid of that 'a lifetime bound on the parameters to set_pipeline, set_bind_group, set_vertex_buffer, and set_index_buffer - presumably by copying the (now reference counted) handles that are passed to the methods. That'll make interacting with the API significantly simpler.

Wumpf commented 5 months ago

Good news! We got almost everything landed now to remove those lifetimes on ComputePass. Now the "only thing" that's left is to apply all the things learned to RenderPass which has a much bigger api surface. Also, so far I wasn't able to find any perf regressions around this either!

gfx-rs / wgpu

Lifetimes on `RenderPass` make it difficult to use. #1453