filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.83k stars 1.25k forks source link

Documenting Sequences that lead to zero bytes transfers #7783

Closed hannahhoward closed 2 years ago

hannahhoward commented 2 years ago

Several storage providers have identified that transfers sometimes can get stuck at "zero bytes" on their miners. In working with a test miner and estuary, so far I have identified the following sequences that seem to produce these "zero byte transfers":

  1. Misapplied network error causes restarts out of sync a. Network error occurs on graphsync request A triggering restart from Estuary b. When a network error occurs, it often occurs multiple times -- and sometimes failed messages can take up to a minute to finish transmitting errors - See https://github.com/ipfs/go-graphsync/issues/314 c. Provider receives restart, and sends a new graphsync request B. Graphsync request reaches Estuary, and is queued d. Estuary receives late network error from graphsync request A, and mistakenly interprets as an error from the second request -- this is a bug : see https://github.com/filecoin-project/go-data-transfer/issues/288 e. Estuary triggers a second restart request f. Provider cancels first request and triggers new outgoing graphsync request C g. Estuary eventually begins processing the first graphsync, sending data that most likely just gets dropped

Proposed solution: fix https://github.com/ipfs/go-graphsync/issues/314 and https://github.com/filecoin-project/go-data-transfer/issues/288

  1. Never got to top of queue error a. The provider receives and accepts a data transfer request b. It queues the outgoing graphsync request c. Due to simultaneous outgoing request limits (20 by default I think), Graphsync request is not sent for several hours d. Before reaching top of queue, markets node restart occurs e. Upon restart, deal enters "storage provider await restart". Nothing else happens. d. Estuary will eventually hit a 24 hour accept timeout and cancel the deal - may or may not receive the timeout

  2. Failed while restarting a. Data transfer in progress when provider restarts, going off line b. Provider is off-line long enough for Estuary to restart the transfer enough times to decide to fail the transfer c. Provider comes back on line, enters StorageProviderAwaitingRestart, never hears anything else as transfer is cancelled on Estuary side.

Proposed solution: The provider does not monitor for restarts or attempt them. The thinking behind this is that the client may not be reachable if not already connected. However, this has the side effect that the provider has no monitor for transfers that have essentially failed. A better solution would be to attempt restarts, but not fail if they don't go through. Moreover, it makes sense for AwaitRestart to have a timeout.

jennijuju commented 2 years ago

@hannahhoward Thanks for the updates!

Tho this doesn't sounds like a lotus issue yet, and most of the fix will be in gs and gdt then get dep integrated to lotus? so would you mind/ would it be proper for me to transfer this to our lotus discussion and the community may track the investigation progress there?