filecoin-project / builtin-actors

The Filecoin built-in actors
Other
80 stars 75 forks source link

Use custom error types and remove error downcasting #52

Open Stebalien opened 2 years ago

Stebalien commented 2 years ago

Currently, we have a lot of "anyhow" errors and rely on error downcasting to figure out the right exit code. Unfortunately, this makes it very difficult to figure out what the exit code should be.

Instead, we'd ideally:

  1. Abort immediately once we know the right exit code (#282).
  2. Return situation-specific errors from state types like partitions & deadlines, so we can figure out the correct exit code.
jennijuju commented 2 years ago

from zen: make sense to do before the audit, but shouldn't be a blocker

anorth commented 2 years ago

197 , #214 does a heap of the work for this (not yet merged, needs stewardship)

anorth commented 1 year ago

I've spent some time investigating how to improve things here, but it turns out to be quite complex. In our current situation we have:

Some problems with this state:

Why do we have downcasting at all? There are some places where we can generate an ActorError with exit code but not propagate it directly, usually due to library code. E.g. the HAMT/AMT for_each() method with return an Error::Dynamic containing an anyhow::Error that's actually an ActorError raised from the callback.

In #1322 I tried replacing a bunch of map and downcast patterns with more concise trait methods. It's on hold at present because a few tests fail due to changes in exit codes, as a code generated inside for_each was trashed (e.g. this USR_ILLEGAL_ARGUMENT in remove partitions).

1338 is another attempt that aimed to avoid inadvertent trashing of exit codes, but it's stuck by the realisation that downcasting needs global knowledge of all the error types that could hide an ActorError, which it doesn't have.

So what shall we do? Some options:

  1. Try to get it "right", with the global knowledge of error types and robust downcasting. We can probably still find more concise patterns than map_err+downcast. Support generating ActorError deep in the code and propagating with ?. Something like leaning into #1338
  2. Give up on preserving exit codes generated low down in the code. Support ActorError deep in the code and ? propagation, but usually only ILLEGAL_STATE. Other exit codes can be generated in top-level actor code. Something like #1322.
  3. Implement specific error types for state objects that enumerate error conditions. Stop returning ActorError from state methods, instead only generate exit code in top-level actor business logic.
  4. Make the abort() syscall available in state methods (perhaps via a new very slim runtime trait passed to them) and call it directly in state code, with no opportunity for actor code to handle the error or adapt the exit code. Implement mock runtime and test VM etc to use panic or similar to support this.
Stebalien commented 1 year ago

E.g. the HAMT/AMT for_each() method with return an Error::Dynamic containing an anyhow::Error that's actually an ActorError raised from the callback.

We should implement iteration with external iterators instead to fix this issue. I have a WIP patch locally that I can try to cleanup and ship if there's interest.

That should remove a large part of this issue.

Give up on preserving exit codes generated low down in the code.

In many cases, I believe this is the right thing to do. Most of these errors should just be "illegal state".

Implement specific error types for state objects that enumerate error conditions. Stop returning ActorError from state methods, instead only generate exit code in top-level actor business logic.

IMO, this is the "correct" solution. It's just really annoying which is why most rust users just use crates like anyhow and yolo it.

But yeah, we can do this where it matters. I.e.:

  1. Implement custom error types.
  2. Implement From<CustomErrorType> for ActorError wherever we care about conversions.

Make the abort() syscall available in state methods (perhaps via a new very slim runtime trait passed to them) and call it directly in state code, with no opportunity for actor code to handle the error or adapt the exit code. Implement mock runtime and test VM etc to use panic or similar to support this.

I like the idea of aborting early but... I'd prefer not to do this in all cases:

  1. IMO, we should abort directly if some state invariant is violated. Specifically: state block doesn't exist, state fails to decode, etc.
  2. IMO, we shouldn't abort directly if, e.g., the user passes an illegal argument, something isn't found, etc. (the actor may want to react to this in some way?).

However, I'm concerned about using the runtime here as it may be hard to thread it through to the very bottom (and makes it harder to reason about what the code can do). Instead, I'd consider exposing it as a bare function that:

  1. Panics when compiled to a native target. We can smuggle the exit code through the panic and catch the panic at the top-level.
  2. Calls the exit syscall otherwise.
anorth commented 1 year ago

External iterators would be great. And then we could "give up" on doing better than "illegal state" for remaining hard cases.

Your proposal re aborting sounds reasonable, and could clean up some boilerplate. In mechanism, why not just call the runtime's 'abort' anyway and have that panic (and catch it out the other side). Conditional compilation and static cling already causes headaches in this repo, I would rather avoid adding any more that we can.