Documenting Sequences that lead to zero bytes transfers

Several storage providers have identified that transfers sometimes can get stuck at "zero bytes" on their miners. In working with a test miner and estuary, so far I have identified the following sequences that seem to produce these "zero byte transfers":

Misapplied network error causes restarts out of sync a. Network error occurs on graphsync request A triggering restart from Estuary b. When a network error occurs, it often occurs multiple times -- and sometimes failed messages can take up to a minute to finish transmitting errors - See https://github.com/ipfs/go-graphsync/issues/314 c. Provider receives restart, and sends a new graphsync request B. Graphsync request reaches Estuary, and is queued d. Estuary receives late network error from graphsync request A, and mistakenly interprets as an error from the second request -- this is a bug : see https://github.com/filecoin-project/go-data-transfer/issues/288 e. Estuary triggers a second restart request f. Provider cancels first request and triggers new outgoing graphsync request C g. Estuary eventually begins processing the first graphsync, sending data that most likely just gets dropped

Proposed solution: fix https://github.com/ipfs/go-graphsync/issues/314 and https://github.com/filecoin-project/go-data-transfer/issues/288

Never got to top of queue error a. The provider receives and accepts a data transfer request b. It queues the outgoing graphsync request c. Due to simultaneous outgoing request limits (20 by default I think), Graphsync request is not sent for several hours d. Before reaching top of queue, markets node restart occurs e. Upon restart, deal enters "storage provider await restart". Nothing else happens. d. Estuary will eventually hit a 24 hour accept timeout and cancel the deal - may or may not receive the timeout
Failed while restarting a. Data transfer in progress when provider restarts, going off line b. Provider is off-line long enough for Estuary to restart the transfer enough times to decide to fail the transfer c. Provider comes back on line, enters StorageProviderAwaitingRestart, never hears anything else as transfer is cancelled on Estuary side.

Proposed solution: The provider does not monitor for restarts or attempt them. The thinking behind this is that the client may not be reachable if not already connected. However, this has the side effect that the provider has no monitor for transfers that have essentially failed. A better solution would be to attempt restarts, but not fail if they don't go through. Moreover, it makes sense for AwaitRestart to have a timeout.

filecoin-project / lotus

Documenting Sequences that lead to zero bytes transfers #7783