Make error plumbing less public in `wgpu-core`

udoprog commented 9 months ago

Is your feature request related to a problem? Please describe.

Thoughts while looking over issues like #5066

The internals of every error is currently part of the public API surface of this project. I'd propose as a policy making these details private unless an error variant or field propagates communicates something which is publicly actionable (See SurfaceError example below).

I believe this is better for at least two reasons:

The API documentation is more terse, allowing users to easily spot what is and isn't important.
Errors can be internally refactored to a greater extent without fear of breaking external use, which will be very useful in a post 1.0 world. While I believe all types are marked #[non_exhaustive], this still does not permit variants or fields to be removed.

Note that all errors should continue to support std::fmt::Display and std::error::Error::source since this allows for a user to print decently rich diagnostics and error chains:

An foo error happened.

Caused by: Bar tried to bar
Caused by: Baz tried to baz

An example of an error which is publicly actionable is SurfaceError::Outdated, which signals that a surface needs to be reconfigured.

Describe the solution you'd like

The pattern I usually deploy is something like this:

// relevant derives
pub struct SomeError {
     kind: SomeErrorKind,
}

// relevant derives
pub(crate) enum SomeErrorKind {
    /* .. internal details ... */
}

For e.g. SurfaceError, I'd probably convert the existing error variants to a simple:

impl SurfaceError {
    pub fn is_outdated(&self) -> bool {
        /* ... */
    }
}

Or, if there is public interest in more than one state:

impl SurfaceError {
    pub fn kind(&self) -> SurfaceErrorKind {
        /* ... */
    }
}

Like with io::Error::kind, the publicly exported SurfaceErrorKind doesn't have to match the internal representation if a representation is available which is more efficient than the publicly exported type.

Wumpf commented 9 months ago

I'd propose as a policy making these details private unless an error variant or field propagates communicates something which is publicly actionable

It's quite hard to tell though when something is actionable and when not. In the "worst case" somebody might want to count errors of a kind and then suddenly all internals are actionable since it's hard to keep statistics otherwise.

I agree that the large API surface can be problematic, but this is also part of the surface of wgpu_core not wgpu, so I'm a lot less concerned: most people are expected to interact with wgpu whereas wgpu_core is more of an implementation detail.

As for the other point of scattered documentation I think we should come up with a shared namespace for all exposed wgpu_core error types, that would improve things a lot without needing to shuffle around what's exposed.

udoprog commented 9 months ago

I might have overestimated how leaky error types from wgpu-core is. If there's no way to access them publicly it should be fine!

One could still lock down the API across crates, because dead code detection can help prune unused error variants. But that's less of a concern.

udoprog commented 9 months ago

I did a small experiment where I hid the error types in wgpu-core and republished them through public wrappers hiding them like proposed above. This allows dead code elimination to kick in, so if they're not produced internally in wgpu-core they're not used. This has resulted in about 15 error variants which are no longer used.

Note that I haven't hidden everything yet, but this might still be worthwhile doing to improve coherency between project components.

Dead code warnings in wgpu-core

``` warning: variant `InvalidBuffer` is never constructed --> wgpu-core\src\command\compute.rs:213:5 | 189 | pub(crate) enum ComputePassErrorInner { | --------------------- variant in this enum ... 213 | InvalidBuffer(id::BufferId), | ^^^^^^^^^^^^^ | = note: `ComputePassErrorInner` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis = note: `#[warn(dead_code)]` on by default warning: variants `SurfaceTextureDropped` and `OutOfMemory` are never constructed --> wgpu-core\src\command\render.rs:593:5 | 540 | pub(crate) enum RenderPassErrorInner { | -------------------- variants in this enum ... 593 | SurfaceTextureDropped, | ^^^^^^^^^^^^^^^^^^^^^ 594 | #[error("Not enough memory left for render pass")] 595 | OutOfMemory, | ^^^^^^^^^^^ | = note: `RenderPassErrorInner` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variants `MissingRenderAttachmentUsageFlag`, `InvalidDimensionExternal`, `InvalidCopySize`, and `ExternalCopyToForbiddenTextureFormat` are never constructed --> wgpu-core\src\command\transfer.rs:57:5 | 43 | pub(crate) enum TransferError { | ------------- variants in this enum ... 57 | MissingRenderAttachmentUsageFlag(TextureId), | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... 81 | InvalidDimensionExternal(TextureId), | ^^^^^^^^^^^^^^^^^^^^^^^^ ... 103 | InvalidCopySize, | ^^^^^^^^^^^^^^^ ... 127 | ExternalCopyToForbiddenTextureFormat(wgt::TextureFormat), | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `TransferError` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variants `SurfaceOutputDropped` and `SurfaceUnconfigured` are never constructed --> wgpu-core\src\device\queue.rs:367:5 | 355 | pub(crate) enum QueueSubmitErrorKind { | -------------------- variants in this enum ... 367 | SurfaceOutputDropped, | ^^^^^^^^^^^^^^^^^^^^ 368 | #[error("Surface was unconfigured before the command buffer got submitted")] 369 | SurfaceUnconfigured, | ^^^^^^^^^^^^^^^^^^^ | = note: `QueueSubmitErrorKind` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variant `NoGraphicsQueue` is never constructed --> wgpu-core\src\instance.rs:437:5 | 427 | pub(crate) enum RequestDeviceErrorKind { | ---------------------- variant in this enum ... 437 | NoGraphicsQueue, | ^^^^^^^^^^^^^^^ | = note: `RequestDeviceErrorKind` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variant `InvalidMinMaxBlendFactors` is never constructed --> wgpu-core\src\pipeline.rs:391:5 | 376 | pub(crate) enum ColorStateError { | --------------- variant in this enum ... 391 | InvalidMinMaxBlendFactors(wgt::BlendComponent), | ^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `ColorStateError` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variant `Failed` is never constructed --> wgpu-core\src\resource.rs:325:5 | 321 | pub(crate) enum BufferAccessErrorKind { | --------------------- variant in this enum ... 325 | Failed, | ^^^^^^ | = note: `BufferAccessErrorKind` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variant `TooManyObjects` is never constructed --> wgpu-core\src\resource.rs:1410:5 | 1391 | pub(crate) enum CreateSamplerErrorKind { | ---------------------- variant in this enum ... 1410 | TooManyObjects, | ^^^^^^^^^^^^^^ | = note: `CreateSamplerErrorKind` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis warning: variant `UnsupportedTextureStorageAccess` is never constructed --> wgpu-core\src\validation.rs:218:5 | 184 | pub(crate) enum BindingErrorKind { | ---------------- variant in this enum ... 218 | UnsupportedTextureStorageAccess(naga::StorageAccess), | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `BindingErrorKind` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis ```

nical commented 9 months ago

It's nice to be able to get warnings about the dead code.

Off the top of my head, among the things that we need to be able to handle outside of wgpu-core:

differentiate between three broad categories of errors: validation, internal and out-of-memory,
special case device loss and the device loss reason.
special case invalid handles
ignore a few specific errors that aren't considered errors in the WebGPU spec, although arguably these probably should not be thrown in the first place. For example wgpu_core produces an error when a buffer is destroyed twice, but it's a valid thing to do per spec.

Wumpf commented 9 months ago

That's not barring the door on the general proposal (and I love how this found all that dead code now!) but over here hit a usecase for fairly deep inspection of the errors: https://github.com/rerun-io/rerun/blob/main/crates/re_renderer/src/error_handling/wgpu_core_error.rs#L131 this code tries to do a custom error de-duplication scheme in order to run with a managable amount of flood of errors whenever shader compilation failed (typically due to runtime supplied shaders). I suppose there's a point to be made that shader errors are special anyways and this doesn't affect most of it!

udoprog commented 9 months ago

So just pitching here one design I'd like to explore if there's interest, where error reporting and tracing is baked together into diagnostics. It's akin to something I call "abductive diagnostics" and I think wgpu can benefit from it as well.

The idea here would be to instead of propagating and wrapping hierarchies of Result's, we'd pass along an error context with which we can perform basic tracing:

impl Device<A> {
    pub(crate) fn create_bind_group(
        self: &Arc<Self>,
        cx: &mut Context<'_>, // <- this is a new argument every fallible function receives.
        layout: &Arc<BindGroupLayout<A>>,
        desc: &binding_model::BindGroupDescriptor,
        hub: &Hub<A>,
    ) -> Result<binding_model::BindGroup<A>, Error> { // <- Error is an empty marker type which is constructed through the context.
        /* .. */
    }
}

Note the ErrorMarker is a type which can only be constructed through context, to ensure that it is consulted before returning from a method.

With this we can trace structures:

for entry in desc.entries.iter() {
    let vertex_buffers_enter = cx.enter("vertex.buffers");

    /* .. */

    // do something fallible, every error which is actually captured by the context includes the semantic trace.
    // Context::result is a convenience function which captures any error and transform the `Result<T, E>` into `Result<T, Error>` which includes the marker type.
    let value = cx.result(do_something_fallible())?;

    // more than one error can be reported at a time
    if !is_valid1(entry) {
        cx.report(Error::IsNotValid1);
    }

    if !is_valid2(entry) {
        cx.report(Error::IsNotValid2);
    }

    cx.leave(vertex_buffers_enter);
}

Special conditions like OutOfMemory would then be handled directly by Context::report and tracked specifically so that it can be correctly propagated in handle_error. Or if a party is interested in some specific error condition they can iterate over error causes which have been collected in the Context.

The benefits would be the following:

We can provide structure (or tracing) to every error consistently. This seems to be a common requirement in errors handled by this library.
We can report and propagate more than one error at a time, which can be useful when validating to reduce turnaround during development.

Pretty printing errors could then be done like this consistently regardless of the kind of error raised, without having to implement it specifically as a trait:

Bind group entry with label "foo" in entries[1] is invalid:
    - Binding count declared with at most 4 items, but 2 items were provided
    - Buffer binding size 10 is less than minimum 4

udoprog commented 9 months ago

So I've built a working prototype of what I suggested above in my a branch here. For a brief overview, it's similar to how mature compilers report errors since they tend to be quite rich in context. I think it would mitigate quite a few of the issues which relate to unhelpful error messages.

If we adopt this pattern we can:

Report more than one error (or warning even) at a time.
Include a trace to the part of a descriptor which caused an error (like vertex.buffers[0].array_stride) for every error. This is tracked separately from the error hierarchy, so error variants don't have to include this.
Consistently record and propagate important error conditions (e.g. out of memory errors, shader compilation errors, ...).

The pattern is mostly backwards compatible and can be incrementally implemented - the prototype only changes three functions for now and doesn't modify existing error hierarchies.

@cwfitzgerald When you're available, can I get some quick feedback before I put any mileage into it, since it would be a pretty significant change to wgpu-core?

ErichDonGubler commented 8 months ago

I haven't been able to digest this the way I think is necessary to get traction on a discussion here, but I think I'm going to have bandwidth to discuss more details in the middle of next week. 🫡

gfx-rs / wgpu

Make error plumbing less public in `wgpu-core` #5071