filecoin-project / specs-actors

DEPRECATED Specification of builtin actors, in the form of executable code.
Other
86 stars 102 forks source link

Defensive programming: harden power actor against seal verify failures #1502

Closed ZenGround0 closed 3 years ago

ZenGround0 commented 3 years ago

While aborts from processBatchProofVerifies are all believed to be the result of programmer error or individual validator system error aborting from this code presents unnecessary risk if this is wrong. This PR changes these aborts to errors which are then logged and ignored from top level power cron.

The impact is that now in case of programmer error a batch of seal proofs will fail to validate and propagate power to storage providers instead of aborting power cron and either halting the chain or taking out 1/60th of sps from deadline cron.

In the worst case where programmer error is global all prove commits (not prove commit aggregates) would be dropped until programmer error fixed. In a more plausible state where programmer error is related to contents of the validation batch prove commits for one epoch would be dropped and then processing would continue as normal.

Reiterating that programmer error is not expected to exist but now defense against it is better.

codecov-commenter commented 3 years ago

Codecov Report

Merging #1502 (6eca502) into master (a2369c5) will decrease coverage by 0.1%. The diff coverage is 29.1%.

@@           Coverage Diff            @@
##           master   #1502     +/-   ##
========================================
- Coverage    71.5%   71.4%   -0.2%     
========================================
  Files          72      72             
  Lines        8620    8634     +14     
========================================
- Hits         6167    6165      -2     
- Misses       1563    1572      +9     
- Partials      890     897      +7     
ZenGround0 commented 3 years ago

I'm a little unsure about this, only because I don't know how safe it is to absorb and ignore all of these cases.

The point of this PR is that it is safer than the alternative of aborting which will leave deadline cron in a bad state. This bad state is totally avoided at the expense of killing prove commit verifies using the approach in this PR. This is a good tradeoff. A few storage Providers burning gas in the event of an emergency is much better than a significant portion of system power being frozen out of cron.

ZenGround0 commented 3 years ago

Can we come up with a system to scream bloody murder if we do wind up erroring?

Yes all actors log rt.ERROR values should be escalated with high severity, they all represent dangerous programmer error.