Error model - Githubissues

kvark commented 4 years ago

Our current error model is a bunch of assert! calls in the function bodies. This can be made safe (in Rust's sense of safety), but it's not matching the upstream spec, and it's not desirable as a model exposed to users.

The upstream has a model like "internal contagious nullability", where things. at some point after they are used wrong, become internally invalid. They spread that invalid state onto things that use them, and so forth. The user would then get an error scope asynchronously with the exact problem.

I wonder if we can start reshaping the current model to propagate errors up to the user instead of asserting. This means the native users will get the errors right away, exactly in the way they expect a Rust library to error. This shouldn't block us in implementing the error scopes API for the Web specifically, it's just going to be unused (or underused) when compiling for native.

kvark commented 4 years ago

Thinking about this more: at the lowest level, the errors are synchronous (gfx-hal, validation on the server). At the highest level, errors are deferred (the CIN model of WebGPU API). There is a layer in the middle where we'd start to deferring errors, therefore. Once we do that, all the higher levels are left with no choice.

One option is to handle all errors in wgpu-core and return Result things to Rust users. That would mean - Gecko, Servo, and wgpu-native would need to implement the error model on their sides. If there is too much code/logic duplication between them, it would make sense to move it in wgpu-core and have the errors deferred there.

kvark commented 4 years ago

Marking for 0.6 release. We've already started converting the methods to return errors, e.g. #706. The idea is that this is the most sensible behavior for wgpu-core, and the lazy WebGPU semantics can and should be implemented on top. More over, the mechanism of error propagation in browsers is very specific to implementation, so having Servo and Gecko implement it separately is needed anyway.

kvark commented 4 years ago

One unknown here is how exactly the browsers will handle it, and how they will associated IDs with something that is in an "error" state, and thus failed to even create. @kunalmohan has hit this problem now in Servo, and I've been thinking on the ways forward. We figured out that in a multiprocess implementation, the client doesn't even need to know if something is in an error state, unless asked specifically by the user. So everything should be handled by the server. Here are some ideas on how to do it:

1. Invalid Id errors

If something failed to create, return an error into the error scope, and don't associate it with the ID. Don't free the ID either, until the user drops all the references to it on the client side. Whenever an ID is used, we generally access our storages by it, panicking if either the index or the epoch is a mismatch. The idea is that a mismatched ID can only happen if another process has been compromised, or something really bad happened, in which case crashing the GPU process is fine. However, what we could do instead, is to return a kind of an Error::InvalidBufferId(Id<Buffer>), for example. This will make it so error IDs can't be used, as desired, but the server would also need to destroy things that are affected (i.e. if you try to use an error buffer id in a command buffer, it's not enough to ignore the operation, we need to destroy the command buffer too and turn it into the error state).

2. Baddie list

The problem with (1) is that we would no longer know on the server if something is really wrong (e.g. we are compromised) versus just some errors in the user code happening. There is another way: keep a set of error IDs on the server side (in a layer between the IPC and wgpu-core). Whenever some commands are received, the server can scan them and check if any dependencies are in that list, and react accordingly. This will keep wgpu-core semantic cleaner, and I like this solution a tiny bit better.

There is a bit of a sub-problem with render/compute passes, where it would take an extra step to scan the commands for all the resources they use. We can either do it naively, or the client side can send a list of used resources to the server alongside a pass. The client side has to keep them alive anyway for the duration of the pass being recorded, so it already has the condensed list. If there is any de-synchronization between the list and what's actually used, we'll get a panic on accessing a non-existing ID, so it's fine.

kvark commented 4 years ago

Let's consider this blocked by #772

kvark commented 4 years ago

Here is an evolved proposal from https://github.com/gfx-rs/wgpu/issues/638#issuecomment-655495321, based on the new data from #776.

First, we proceed with #772 and #776, leaving the Index implementations as is, to panic when there is anything but Occupied. This indexing is what we are going to keep using in most cases, internally.

The places where we do need checking are top-level usages of IDs provided directly by the users. So, for example, the user provides a bind group ID. We need to check if it's an Error, but we do not need to check any resources it points to, including the views, textures, and buffers, for being an Error. They can only become errors if they failed on creation, in which case the bind group would also be an error to start with.

Next, we separate the internal IDs from external IDs, precisely so that we know at type level which of them are expected to be valid (internal IDs), and which we need to check (UserId). That separation also makes them live in different crates: user IDs live in wgpu-types, while internal IDs only need to live in wgpu-core. I think it's neat because we can call both of them Id.

Finally, the only way to resolve a user ID is via a new method that we are going to add to Registry (or Tracker). It will return a Result<&T, InvalidId> basically, allowing us to skip the invalid operations and return errors in them, without changing any of the internal logic.

GabrielMajeri commented 4 years ago

I've analysed the code, these are the files which still need safe error handling:

~command/mod.rs: there's an assert left in command_encoder_finish~
~device/life.rs: there's wait_idle with the notorious GPU got stuck :(~
~device/mod.rs: lot of assertions related to validating resource creation~
~device/queue.rs: some asserts left for validating queue write parameters~
~track/*.rs: there are a lot of assertions in this module, but it seems they exist to maintain internal invariants. I don't think returning errors from here would be useful.~
~binding_model.rs: uses safe error handling, but in some cases it logs error messages instead of using Display~
~conv.rs: some assertions, they could be converted to error types. These end up being called in other places and should be bubbled up.~
~hub.rs: assertions used as internal checks~
~instance.rs: two assertions which could be replaced with error types~
~swap_chain.rs: just one assertion, which should be replaced with an error type~
~validation.rs: some panics, could be replaced with safe error types~

kvark commented 4 years ago

Thank you, this is great analysis @GabrielMajeri ! Some of the hub related panics can stay, since they can only occur when something really bad is going on, like we receive garbage IDs, or the ID management is broken, or the client process is compromised. So we'll need to look at it last, and sort the other ones first.

kvark commented 4 years ago

The conversion of functions to return Result is now completed with #845 thanks to @GabrielMajeri efforts! What's left is a separation between "internal" IDs and "external" IDs. The external ones would have to check for the corresponding Error variant in the hub, and gracefully handle that. The internal ones would just have Error => unreachable!().

gfx-rs / wgpu

Error model #638

1. Invalid Id errors

2. Baddie list