eigerco / polka-storage

The Polka Storage Parachain Project
https://eigerco.github.io/polka-storage-book
Apache License 2.0
2 stars 0 forks source link

Error recovery in the node #493

Open th7nder opened 3 weeks ago

th7nder commented 3 weeks ago
          Error reporting, we should log it and eventually add them to rocksdb

About the retries, it depends where the error happened. If it was a disk failure, it might not be worth it

_Originally posted by @jmg-duarte in https://github.com/eigerco/polka-storage/pull/483#discussion_r1824096430_

E.g. when Piece was being added, and cancelled (node shutdown), on boot we need to check pieces which have been cancelled and retry them.

jmg-duarte commented 3 weeks ago

More details:

But how do we know we should retry it? When cancelling we need to place some state in the database that indicates that AddPiece with certain parameters was cancelled. Or better yet, add it in the beginning and removing it when add_piece finishes.

Also, is AddPiece idempotent? If retrying is fine, it indicates that it may be — https://github.com/eigerco/polka-storage/pull/483/files#r1824853789

You're right, AddPiece is not idempotent at this stage, as it'll create a new sector each time it is called. It is cancellation safe though, because when interrupted it does not leave the state corrupted. It won't add deal to a sector, the SP will be slashed and end of story.

About retries, right... We don't have a mechanism for that yet. We'd need to store the state of the piece somewhere (modify the deal storage, because AddPiece can be called again solely based on published deal data) and have some periodic scanner, on startup that checks whether something was in process but not finished and clean it up, call add piece again.

Maybe can be done as part of #493? — https://github.com/eigerco/polka-storage/pull/483/files#r1825948465

As a takeaway, we'll need to track the pieces in some persistent storage (like RocksDB) and perform checks and retries.