Open nical opened 1 year ago
Regarding the min alignment value, I think 32 is appropriate for the same reason the min{Uniform,Storage}BufferOffsetAlignment
limits must be at least 32 according to the WebGPU spec.
Is there a strong reason to avoid using the minimum alignment allowed by mapAsync
? I think this is currently 8
right, mapAsync
's offset
's alignment is 8; that means that we can't guarantee an alignment of more than 8.
It should be ok though, at least it's the size of a 64bit type.
Is there a strong reason to avoid using the minimum alignment allowed by
mapAsync
? I think this is currently 8
The spec requiring users to make offset
a multiple of 8 but does not provide guarantee about the alignment of the start of the mapped range. I'll ask the question in reverse: Is there a strong reason the guarantees wgpu provide on mapped buffer alignment should be the same as the requirement of the the offset field?
The motivations I mentioned are:
8 bytes is too small IMHO. Things start being pretty safe and useful around 16 and I don't think it's worth ensuring anything larger than 64 bytes.
If it turns out that it can be achieved conveniently on every backend, my suggestion is 64 bytes alignment in map_async
and at least 16 bytes in map_buffer_with
.
The spec requiring users to make
offset
a multiple of 8 but does not provide guarantee about the alignment of the start of the mapped range. I'll ask the question in reverse: Is there a strong reason the guarantees wgpu provide on mapped buffer alignment should be the same as the requirement of the the offset field?
I was thinking in terms of getting a sub-region of a slice (in rust terms) where the alignment of the sub-region would have to be at most the offset alignment but I guess that doesn't apply here since the offset is passed down to the backend to do the copy. Although I think the story changes if we throw unified memory in the mix. I think this needs some more research.
Is there a strong reason the guarantees wgpu provide on mapped buffer alignment should be the same as the requirement of the the offset field?
I'd be slightly hesitant to add alignment guarantees that don't currently seem to exist in WebGPU, although maybe there would be interest adding these guarantees upstream. I completely agree that it would be great to avoid the panic when casting slices, but I'm not sure about how tricky it might be to guarantee extra alignment.
Thinking about it some more I think the minimum guaranteed alignment on the size right now is 4 and the expectation is the backing ArrayBuffer
fails to be viewed as a TypedArray
views if it's not aligned (e.g., a Float64Array
can't be created out of a mapped ArrayBuffer
that's 7 bytes). Conceptually this feels really similar to bytemuck
failing to cast a slice if it can't fit into the mapped memory. Maybe we could provide a convenience function that introduces an extra copy if necessary?
On the web I don't expect this to matter because webgpu buffers are copied to and from typed arrays. The runtime only needs to memcpy bytes around.
The backing ArrayBuffer
is provided to us and we can't control its size without possibly needing to request an oversized buffer during buffer creation, so I think we run into the same problem here unless I'm misunderstanding.
There could also be some subtle interactions with mapped subregions. I can't remember the resolution on overlapping mapped subregions in WebGPU but there could be a subtle interaction there if we try to guarantee larger alignments vs. the lengths of the subregions (currently 4?).
We might also want to consider how this works with buffers with unaligned lengths (e.g., too small or just not aligned). In places where we don't use require memory temporary allocations for memory mapping, this might mean we need oversized buffers to guarantee that a later aligned map won't fail.
but I'm not sure about how tricky it might be to guarantee extra alignment.
Yes, that's what I would like to figure out. I suspect that we can guarantee comfortable alignment almost everywhere without much implementation effort. At least it would be useful to understand and document what alignment one can rely on (per backend if need be).
I'm interested in the details if you have any in the case of wasm. On the web (and I suspect everywhere esle), the runtime creates a typed array provided to mapAsync that is separate from the wasm heap. The generated rust bindings have to make copies between the wasm heap and the real WebGPU typed array. In this context, if an alignment is to be guaranteed it would come from where the bindings choose to allocate this copy maybe there's something we can do there.
We can also survey the situation for sub-regions. I expect that there are two situations:
offset % base_alignement
.documenting a guaranteed alignment of offset % base_alignement
would cover both and be useful enough for users (for example the simd use case).
What would you think about starting with documenting the conservative minimum alignment of 4 bytes? This is the minimum required for the mapping size in WebGPU anyway, so any lower would be an implementation bug.
We could also provide better documentation that recommends creating oversized buffers if alignment with existing types is a concern, or maybe even some kind of helper for unaligned slices that slices the byte slice at the right length (e.g., working nicely with bytemuck
/crevice
/encase
).
Later we could reconsider raising the minimum guaranteed alignment if it still turns out to be a significant pain point. At least anecdotally I haven't heard of many people running into this, so I wonder if it's worth extra implementation complexity in each backend (at least at this point).
The generated rust bindings have to make copies between the wasm heap and the real WebGPU typed array. In this context, if an alignment is to be guaranteed it would come from where the bindings choose to allocate this copy maybe there's something we can do there.
We technically already have an extra temporary copy right now so we could pad that, but I think we'll be able to improve this eventually and write into the Uint8Array
directly (i.e., not writing into the wasm heap).
What would you think about starting with documenting the conservative minimum alignment of 4 bytes? This is the minimum required for the mapping size in WebGPU anyway, so any lower would be an implementation bug.
Well that's not what I am after but I'm all for someone adding this to the doc while the bigger picture is being figured out.
Later we could reconsider raising the minimum guaranteed alignment if it still turns out to be a significant pain point.
There is no need to stall. For map_async. I want 1) to gather the information about the guarantees we get fro free with each backend, 2) to document what can be safely relied on and where, and 3) see if there are improvements that can be made without cost or heroics.
There should be no controversial decision to make for 1 and 2, and we'll see about 3 when we have a better lay of the land.
For write_buffer_with
, adding some alignment should be pretty easy regardless of the backend. Since there is some debate, maybe we can focus on the map_async
clarification here.
At least anecdotally I haven't heard of many people running into this, so I wonder if it's worth extra implementation complexity in each backend (at least at this point).
While researching this, I've seen the simd use case pop in various places, it was also what prompted this discussion. We can argue about implementation complexity when there is complexity to argue about, I expect that the guarantees are already pretty strong almost everywhere (except wasm maybe).
Most likely this isn't a very common pain point because people do unknowingly rely on alignments they don't know they need but get in practice.
Alright that sounds good. I do think we should try to upstream any guarantees into WebGPU and webgpu-headers if we do end up raising the alignment guarantee, but I agree that it makes sense to investigate it here first.
Most likely this isn't a very common pain point because people do unknowingly rely on alignments they don't know they need but get in practice.
Definitely possible, it would probably be caused by SIMD types like you mentioned.
when a user maps a buffer, it can be tempting to cast the byte slice into a slice of whatever it is they are filling the buffer with and then copy into that typed slice
Isn't this exactly the type of subtle UB that presser was made to protect people from?
Isn't this exactly the type of subtle UB that presser was made to protect people from?
Presser protects you from more than that (padding within the structures for example), but that's beside the point. I'm convinced that wgpu already mostly provides comfortable alignment guarantees for mapped buffers and that bridging the gaps will require no heroics nor cost so there is no reason for wgpu not provide the comfort of, say, 16 bytes alignment and remove a source of mistakes in the process.
I'm unfortunately unable to spend much time in front of a computer for a few of weeks.
Where I'm at regarding this, is that I'm 99.9% certain that the alignment we get in d3d12 and metal is basically the alignment of the GPU buffer itself. I'd like to find spec wording that explicit confirms that. At the very least D3D11 at some point explicitly guaranteed 16 bytes alignment not for the needs of the driver but for user convenience so it is pretty safe to assume they didn't take that away.
That would make it trivial to guaranteed 16 bytes on all native backends (assuming ARB_map_buffer_alignment in GL). I haven't had time to look closely at the case of web.
I had a workaround in place so far to check alignment after mapping, then add padding if necessary and use the buffer with an offset. Noticed today, that this can cause spurious failures: I got a 2 (two!!) byte aligned buffer back on WebGL, applied a padding of 2 and the ofc got an error when doing a copy_buffer_to_texture
operation which - depending on the texture format requires the offset to be padded somewhere between 4 and 16.
This is a very unfortunate tying of the "gpu offset" with the "cpu offset". I also think this illustrates that any alignment guarantee other than 16 is a bit non-sensical (a buffer offset that is required to be aligned to 16 makes no sense when the underlying data pointer isn't aligned to begin with)
FYI I did a bit of investigation on this over in webgpu.h: https://github.com/webgpu-native/webgpu-headers/issues/180#issuecomment-1654333273 I don't think there's anything new that you hadn't already discovered, except we confirmed that D3D12 provides 16-byte alignment.
For wasm, we completely control the memory allocation - we can align it however we see fit. Both manual copy-to-pointer and copy-to-slice exist on Uint8Array, so we can do whatever we want. https://docs.rs/js-sys/latest/js_sys/struct.Uint8Array.html#method.copy_to
Myles@apple says that:
MTLBuffer.contents "is guaranteed to be at least aligned with the largest alignment that any MSL type requires.” (Which ends up being 16B)
TLDR
queue.write_buffer_with
The long version
The question came up on matrix
when a user maps a buffer, it can bu tempting to cast the byte slice into a slice of whatever it is they are filling the buffer with and then copy into that typed slice. Doing that requires that the mapped slice have the minimum alignment of the type in question. If wgpu were to guarantee a minimum alignment, it would make this pattern easier to get right.
On the web I don't expect this to matter because webgpu buffers are copied to and from typed arrays. The runtime only needs to memcpy bytes around.
I don't see much in the way of documentation or specification about that the alignment guarantees of mapped buffers.
In backends that use a memory allocator like vulkan, we can pass the alignment as a parameter of the allocation request (example) so it would be very easy to set the minimum alignment to some value.
In other backends it might be up to the driver. The various specs I looked at are better at documenting alignment requirements that users must abide to when writing data than alignment guarantees that the drivers must uphold when mapping.
Can wgpu provide alignment guarantees in mapAsync?
Per backend:
If we can what would a good minimum alignment be?
Buffers are not supposed to be small, so I would go as far as give a whole 64 bytes of alignment. That's quite big but not that much compared to a typical buffer size, and in some rare but real situations it's nice to know how data fits into L1 cache lines. At a minimum, guaranteeing 16 bytes would allow people to read or write common simd types with peace of mind.
The case of
queue.write_buffer_with
Since this uses a simple bump allocation, it should be easy to provide alignment guarantees. Should we? How much?