Closed aloucks closed 4 months ago
This is indeed a feature that makes it more difficult to use. What we gain from it is very lightweight render pass recording. Explanation follows.
wgpu
is designed to fully support multi-threading. We do this by having whole storages of different objects locked upon access. So generally, touching anything has a CPU cost. If we had to access each of the refcounts of the resources used by render commands, we'd be paying that cost during the render pass recording, which is hot code. Now with the recent changes, recording a pass is very lightweight, no locks involved.
Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic. Generally, unless you are creating many resources during recording, it's easy to work around. If you are doing that, you aren't on a good path performance wise, and should consider creating one big buffer instead per frame.
Another way to possibly address this is to have wgpu-rs
ensuring the lifetimes of the objects by other means, like keeping a refcount (note: talking about wgpu-rs
here, not wgpu
, which already keeps a refcount, but we'd need to lock an object storage to access it). This is far fetched and not something I'm excited about :)
head's up to @mitchmindtree, who is on 0.4 and will face this issue soon. It would be good to know how much this would affect their case.
Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.
I think the issue might be a bit more severe. I put the buffers into a Vec
that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) from set_index_buffer
, the vec lifetime becomes "linked" to the RenderPass<'render>
lifetime in a mutable borrow. This prevents accessing it again.
Judging by gfx-rs/wgpu-rs#155 and gfx-rs/wgpu-rs#168 I don't imagine it should affect us a great deal - most of the abstractions in nannou take a &mut CommandEncoder
and encode the whole render pass within the scope of a single function, e.g.
UI
render pass https://github.com/nannou-org/nannou/blob/master/src/ui.rs#L451Draw
API render pass https://github.com/nannou-org/nannou/blob/master/src/draw/backend/wgpu/mod.rs#L179TextureReshaper
render pass https://github.com/nannou-org/nannou/blob/master/src/wgpu/texture/reshaper/mod.rs#L117These are just a few examples - generally all of these are submitted on a single CommandBuffer
once per frame. Most nannou users might not use all these abstractions in a single sketch/app though.
Anyway, I hope I'm not speaking too soon as I haven't tried updating yet. There are some other things I'd like to address in nannou first, but I'll report back once I get around to it.
Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.
I think the issue might be a bit more severe. I put the buffers into a
Vec
that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) fromset_index_buffer
, the vec lifetime becomes "linked" to theRenderPass<'render>
lifetime in a mutable borrow. This prevents accessing it again.
Yes. So the good news is - this is a not the best pattern to follow as a use case: creating buffers as you are recording a pass. Would it be possible for you to refactor the code in a way that first figures out how much space is needed for, say, all indices in a pass, creating a single buffer, and then using it through the pass?
I also hit this issue in the game engine I've been building. It has a relatively complex "Render Graph" style api and I spent a solid ~15 hours over the last week refactoring everything to account for the new lifetime requirements. In general I solved the problem the same way @kvark outlined. Although I really wish I had found this thread before coming to the same conclusion via trial and error :smile: .
My final solution was:
My takeaways from this (honestly very painful) experience:
That being said, I understand why these lifetime changes were made and I think the wgpu-rs team made the right call here. I certainly don't want this to read as a complaint. I deeply appreciate the work @kvark and the wgpu team has done here. I just want to add my experience as a data point.
Thank you for feedback @cart !
Just wanted to add that this is all being evaluated. We aren't completely sure if these lifetimes are a good idea. It's certainly the easiest for wgpu
to work with, but I totally agree that it could cause headaches for the users... and it does.
The good thing here is that wgpu-rs
is just a Rust idiomatic wrapper around wgpu
, which is a C API and it doesn't have explicit lifetimes (although, same lifetimes are required implicitly). So what we could do is having others pass variants, e.g. ArcRenderPass
and ArcComputePass
, which would work similarly but receive Arc<>
in their parameters and store the references inside, e.g.:
struct ArcRenderPass<'a> {
id: wgc::id::RenderPassId,
_parent: &'a mut CommandEncoder,
used_buffers: Vec<Arc<Buffer>>,
}
impl ArcRenderPass<'_> {
fn set_vertex_buffer(&mut self, slot: u32, buffer: &Arc<Buffer>, offset: BufferOffset) {
self.used_buffers.push(Arc::clone(buffer));
unsafe {
wgn::wgpu_render_pass_set_vertex_buffer(
self.id.as_mut().unwrap(),
slot,
buffer.id,
offset,
)
};
}
}
These passes could be used interchangeably with the current ones and trade the life time restrictions to a bit of run-time overhead for the Arc
. We could go further and try to encapsulate the thing that keeps track of the resources, which you can only append to. There is a lot of ways to be fancy and lazy here :)
Ooh I think I like the "multiple pass variants" idea because it gives people the choice of "cognitive load vs runtime cost". The downsides I can see are:
On the other hand, the "zero cost abstraction" we have currently feels more in line with the Rust mindset and I'm sure many people would prefer it. I'm also in the weird position where I'm over the "migration hump" and now I really want a zero cost abstraction. Its hard for me to be objective here :smile:
I think this could be solved with either:
If I had to pick one today, I think I would go for (2). Rather than complicating the api surface / being forced to support that forever and document it clearly, just see if additional docs and examples for the "zero cost lifetimes" solves the problem well enough for users. If this continues to be a problem you can always add the variant(s). Removing apis is harder on users than adding features, so I think it makes sense to bias toward a smaller api.
The other interesting aspect is that in this use case, we don't care about the Buffer
lifetime in terms of it's memory location. We only care that the Drop
impl does not run. The difference is subtle but it opens up some other possibilities. For example, we could relax the lifetime constraints and then alter the Drop
impl to send the ID to a deferred deletion list rather than delete immediately.
While on the topic of lifetimes and safety, what happens if a Device
is dropped before a Queue
, Buffer
, BindGroup
, etc? Are there guards in WGPU core/native that protect against this? If so, does it make sense to have guards in this case, but not in the case of a buffer being dropped before the pass is finished recording?
For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.
Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).
While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc?
Generally, we have all the objects refcounted, and you don't lose the device just because you drop it. The only exception really is render/compute pass recording, where we only want to work with ID and not go into the objects themselves (until the recording is finished) to bump the refcounts.
Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).
This would appear the simplest option to me. It can probably even be done without breaking changes:
pub enum BufferOwnedOrRef<'a> {
Owned(Buffer),
Ref(&'a Buffer),
}
impl<'a> From<Buffer> for BufferOwnedOrRef<'a> {
fn from(b: Buffer) -> Self {
BufferOwnedOrRef::Owned(b)
}
}
impl<'a> From<&'a Buffer> for BufferOwnedOrRef<'a> {
fn from(b: &'a Buffer) -> Self {
BufferOwnedOrRef::Ref(b)
}
}
pub fn set_vertex_buffer<'a, B: Into<BufferOwnedOrRef<'a>>(
&mut self,
slot: u32,
buffer: B,
offset: BufferAddress,
size: BufferAddress
)
@dhardy yes, we could. I hesitate, however, because I see the value in not promoting the code path where the user creates resources in the middle of a render pass. It's an anti-pattern. The only reason that could make this path appealing today is because updating GPU data is hard.
Here is what needs to happen (ideally) when you are creating a new vertex buffer with data:
Now, imagine you already have a buffer that is big enough(!). That would spare you (3) but otherwise follow the same steps. Therefore, there is no reason for us to make it easy to create new buffers, even if you are replacing all the contents of something. It's always more efficient to use an existing one.
The only caveat is - what if you need a bigger buffer? Let's see if this becomes a blocker.
For the data uploads, the group is still talking about the ways to do it. Hopefully, soon...
FYI, you can emulate the ArcRenderPass API using arenas in user space, and it should basically be just as efficient as the equivalent WGPU API (unless core implementation details change a lot to increment internal reference counts before the RenderPass is dropped).
struct ArcRenderPass<'a> {
arena: &'a TypedArena<Arc<Buffer>>,
render_pass: RenderPass<'a>
}
impl<'a> ArcRenderPass<'a> {
fn set_vertex_buffer(&mut self, slot: u32, buffer: Arc<Buffer>, offset: BufferOffset) {
let buffer = self.arena.alloc(buffer);
self.render_pass.set_vertex_buffer(slot, buffer, offset);
}
}
fn blah<'a>(encoder: &'a mut CommandEncoder) {
let arena = TypedArena::new();
let arc_render_pass = ArcRenderPass {
arena,
render_pass: encoder.begin_render_pass(..),
};
// ... Do stuff; you can pass around &mut ArcRenderPass and call set_vertex_buffer on owned `Arc`s.
}
@pythonesque it would be wonderful if we had that used by one of the examples. Would you mind doing a PR for this? We'd then be able to point users to working code instead of this snippet.
Just to provide another data point, I hit this issue as well.
Consider that I want my user to be able to simply call an API to render high-level objects without worrying about details of which buffers to use. There are 2 options: 1) Define a buffer size upfront. If they exceed it at any point when issuing a high-level render call, internally end the current render pass, and then set up the same render pass again so we can reuse the same buffer.
2) Instead of ending the render pass, we create buffers dynamically within the same render-pass to avoid setting up identical states.
I'm not sure which is more performant. With option (1), it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism (unless I force the user to go full async and/or use double-buffering). With option (2), we're infinitely-buffered but pays for the cost of allocations.
Ultimately, with the current lifetime constraints, option (2) is not possible. So we're forced to go for option (1).
As a side point, it is a little clunky using the current buffer mapping API to go with option (1). I referred to gfx-rs/wgpu-rs#9 and saw this advice from @kvark:
That's why the current rough and effective way to update data is to go through create_buffer_mapped.
which seemed to contradict the approach to its core.
@DefinitelyNotRobot I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:
it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism
Also, this part:
As a side point, it is a little clunky using the current buffer mapping API to go with option (1).
This issue gfx-rs/wgpu-rs#9 is actually no longer a problem. The upstream WebGPU API went in this direction, and it's a part of wgpu-0.6.
Did you consider using the TypedArena<Arc<wgpu::Buffer>>
and stuff like https://github.com/gfx-rs/wgpu-rs/issues/188#issuecomment-631143941 suggests?
I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:
Sorry about that. I was trying to illustrate the 2 designs that I could go with my API + their pros/cons and meant to say that option (2) was not even considerable because of RenderPass
's lifetime constraints.
Did you consider using the
TypedArena<Arc<wgpu::Buffer>>
and stuff like gfx-rs/wgpu-rs#188 (comment) suggests?
I was hesitant because that would mean an allocation for TypedArena
every render call but on second thought, it seems I could haul it out and store it in my renderer object instead. I'll give that a try.
What's the status of this? I think it would be very useful. I'm writing a renderer that would allow users to create custom pipelines, and they'd be provided with the state of the renderer using a trait. The issue is that I'm getting massive lifetime issues that are caused by the fact that I'm using a lot of RwLock
s in my renderer state, meaning that within that trait, I cannot give RenderPass
data from within a RwLock guard, as that data gets dropped at the end of the trait. If I had a way to instead pass in Arc
s of buffers, this would clear up all of my issues.
At this point, any improvement here should be blocked on @pythonesque work in Arc-anizing the internals.
This is now blocked on #2710.
This is still pretty painful. The interactions with this lifetime mean that any object which embeds wgpu resources becomes untouchable the instant you record one of those resources into a renderpass. And since we can't clone wgpu resource ids in rust today, there's no way to work around this with temporary handles on the application side.
For a non-trivial application, I'm ending up having to wrap all the wgpu resource types in wrapper objects, define new handles for them, and pass those around in my code...
This is still pretty painful. The interactions with this lifetime mean that any object which embeds wgpu resources becomes untouchable the instant you record one of those resources into a renderpass. And since we can't clone wgpu resource ids in rust today, there's no way to work around this with temporary handles on the application side.
For a non-trivial application, I'm ending up having to wrap all the wgpu resource types in wrapper objects, define new handles for them, and pass those around in my code...
Adding onto this, I had to write a custom un-typed arena that would allocate Arc
This is the reason I gave up on wgpu.
@griffi-gh I was working on a wgpu encapsulate layer to work around these issues, which may be useful for your reference. Although this lifetime requirement imposes great constraints in renderpass encoding for performance and safety reasons, I still think the wgpu is a super good graphics abstraction layer implementation.
https://github.com/mikialex/rendiation/tree/master/platform/graphics/webgpu
Y'all will be happy to know that this requirement will be able to be removed once #3626 lands.
Now that https://github.com/gfx-rs/wgpu/pull/3626 has been merged, what's the next step for this issue? I might be interesting in working on this.
The main task is to refactor RenderPass
to try and get rid of that 'a
lifetime bound on the parameters to set_pipeline
, set_bind_group
, set_vertex_buffer
, and set_index_buffer
- presumably by copying the (now reference counted) handles that are passed to the methods. That'll make interacting with the API significantly simpler.
Good news! We got almost everything landed now to remove those lifetimes on ComputePass
. Now the "only thing" that's left is to apply all the things learned to RenderPass
which has a much bigger api surface. Also, so far I wasn't able to find any perf regressions around this either!
Related PRs:
https://github.com/gfx-rs/wgpu-rs/pull/155 https://github.com/gfx-rs/wgpu-rs/pull/168
For context, I was looking at updating imgui-wgpu to work with the current master.
Now that
set_index_buffer
(and similar methods) take the a&'a Buffer
instead of&Buffer
, it has a ripple effect of "infecting" the surrounding scope with this lifetime.While attempting to iterate and add draw calls, you end up with either "multiple mutable borrow" errors on wherever you're storing the
Buffer
or similar lifetime errors like "data fromself
flows intorpass
here"I would think that by calling
set_index_buffer
(and similar methods) that it would increment the ref-count on the buffer so that this lifetime restriction isn't needed.