Stebalien commented 2 years ago

Currently, we have a lot of "anyhow" errors and rely on error downcasting to figure out the right exit code. Unfortunately, this makes it very difficult to figure out what the exit code should be.

Instead, we'd ideally:

Abort immediately once we know the right exit code (#282).
Return situation-specific errors from state types like partitions & deadlines, so we can figure out the correct exit code.

jennijuju commented 2 years ago

from zen: make sense to do before the audit, but shouldn't be a blocker

anorth commented 2 years ago

197 , #214 does a heap of the work for this (not yet merged, needs stewardship)

anorth commented 1 year ago

I've spent some time investigating how to improve things here, but it turns out to be quite complex. In our current situation we have:

A pattern of [Result].map_err(|e| e.downcast_default(...)) for adapting non-ActorErrors into ActorError. It's quite verbose.
A trait AsActorError implemented for Result<T, E: Display> with context_code() and similar methods to adapt into an Result<T, ActorError> with a specified exit code. These are concise to use but don't attempt any downcasting to discover wrapped exit codes.

Some problems with this state:

The map and downcast pattern isn't uniformly used. There are plenty of instances of mapping to a concrete ActorError without attempting to downcast. The pattern is not obvious enough to apply consistently.
The downcasting implementation relies on global knowledge of all the types that could be wrapping an ActorError. It's incomplete, e.g. it's missing fvm_ipld_kamt::Errorand the frc46 token StateError (which can wrap an fvm_ipld_hamt::Error). In Rust, we'd need to import all these error types into the module defining the trait in order to implement it, and could easily miss new ones. The approach doesn't scale to more different error types.
Using AsActorError::context_code() can fail to propagate an ActorError nested inside, e.g. hamt::Error::Dynamic.
Because ActorError implements Display, the context_code() method can be inadvertently used to overwrite an exit code even when it's not hidden inside another error type

Why do we have downcasting at all? There are some places where we can generate an ActorError with exit code but not propagate it directly, usually due to library code. E.g. the HAMT/AMT for_each() method with return an Error::Dynamic containing an anyhow::Error that's actually an ActorError raised from the callback.

In #1322 I tried replacing a bunch of map and downcast patterns with more concise trait methods. It's on hold at present because a few tests fail due to changes in exit codes, as a code generated inside for_each was trashed (e.g. this USR_ILLEGAL_ARGUMENT in remove partitions).

1338 is another attempt that aimed to avoid inadvertent trashing of exit codes, but it's stuck by the realisation that downcasting needs global knowledge of all the error types that could hide an ActorError, which it doesn't have.

So what shall we do? Some options:

Try to get it "right", with the global knowledge of error types and robust downcasting. We can probably still find more concise patterns than map_err+downcast. Support generating ActorError deep in the code and propagating with ?. Something like leaning into #1338
Give up on preserving exit codes generated low down in the code. Support ActorError deep in the code and ? propagation, but usually only ILLEGAL_STATE. Other exit codes can be generated in top-level actor code. Something like #1322.
Implement specific error types for state objects that enumerate error conditions. Stop returning ActorError from state methods, instead only generate exit code in top-level actor business logic.
Make the abort() syscall available in state methods (perhaps via a new very slim runtime trait passed to them) and call it directly in state code, with no opportunity for actor code to handle the error or adapt the exit code. Implement mock runtime and test VM etc to use panic or similar to support this.

Stebalien commented 1 year ago

E.g. the HAMT/AMT for_each() method with return an Error::Dynamic containing an anyhow::Error that's actually an ActorError raised from the callback.

We should implement iteration with external iterators instead to fix this issue. I have a WIP patch locally that I can try to cleanup and ship if there's interest.

That should remove a large part of this issue.

Give up on preserving exit codes generated low down in the code.

In many cases, I believe this is the right thing to do. Most of these errors should just be "illegal state".

Implement specific error types for state objects that enumerate error conditions. Stop returning ActorError from state methods, instead only generate exit code in top-level actor business logic.

IMO, this is the "correct" solution. It's just really annoying which is why most rust users just use crates like anyhow and yolo it.

But yeah, we can do this where it matters. I.e.:

Implement custom error types.
Implement From<CustomErrorType> for ActorError wherever we care about conversions.

Make the abort() syscall available in state methods (perhaps via a new very slim runtime trait passed to them) and call it directly in state code, with no opportunity for actor code to handle the error or adapt the exit code. Implement mock runtime and test VM etc to use panic or similar to support this.

I like the idea of aborting early but... I'd prefer not to do this in all cases:

IMO, we should abort directly if some state invariant is violated. Specifically: state block doesn't exist, state fails to decode, etc.
IMO, we shouldn't abort directly if, e.g., the user passes an illegal argument, something isn't found, etc. (the actor may want to react to this in some way?).

However, I'm concerned about using the runtime here as it may be hard to thread it through to the very bottom (and makes it harder to reason about what the code can do). Instead, I'd consider exposing it as a bare function that:

Panics when compiled to a native target. We can smuggle the exit code through the panic and catch the panic at the top-level.
Calls the exit syscall otherwise.

anorth commented 1 year ago

External iterators would be great. And then we could "give up" on doing better than "illegal state" for remaining hard cases.

Your proposal re aborting sounds reasonable, and could clean up some boilerplate. In mechanism, why not just call the runtime's 'abort' anyway and have that panic (and catch it out the other side). Conditional compilation and static cling already causes headaches in this repo, I would rather avoid adding any more that we can.

filecoin-project / builtin-actors

Use custom error types and remove error downcasting #52

197 , #214 does a heap of the work for this (not yet merged, needs stewardship)

1338 is another attempt that aimed to avoid inadvertent trashing of exit codes, but it's stuck by the realisation that downcasting needs global knowledge of all the error types that could hide an ActorError, which it doesn't have.