application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Goroutine pool for filclient calls #115

Open rvagg opened 2 years ago

rvagg commented 2 years ago

Summary of current state

We currently have two layers of goroutines hiding under filecoin.Retriever#Request(cid.Cid) that's called for each CID coming in from Bitswap that we don't have.

If we have candidates from the indexer for that CID, we:

  1. Spawn a goroutine for filecoin.Retriever#retrieveFromBestCandidate()
  2. Within that execution we call filecoin.Retriever#queryCandidates() which in turns spawns an additional goroutine for each candidate so we can call Filclient#RetrievalQueryToPeer() for it.
  3. After we've finished the query phase, those query goroutines are ended and if we have one or more viable query responses we continue in the parent goroutine, calling Filclient#RetrieveContentFromPeerWithProgressCallback() in sequence for each candidate.

Among other things, these calls come back to Retriever#OnRetrievalEvent(). If it blocks because the EventManager is blocked, then that can hold up further Filclient retrieval events, which will block the parent Filclient calls, which in turn block our goroutines.

Our new async EventRecorder has a queuedReports that it slurps in from Retriever and will buffer in a slice. So if that's working well then the OnRetrievalEvent calls shouldn't block. But that just means we don't have any practical form of backpressure that applies through the system now.

As the number of CIDs with a positive result from the indexer increases, an Autoretrieve instance is going to have an equivalent increase in retrieval attempts—even if it's just failed query attempts, that's still costly. We're already hitting that with our instances with the added event reporting load (Ref: #112) but it'll get worse.

Ideal

Bitswap requests for CIDs should be rejected if we are currently "at capacity". Capacity can be defined as either actual maximum capacity of the instance (dictated by network, memory, and other resources), or by the owner in configuration options.

  1. EventRecorder should block when it's at some pre-defined capacity which can be configured, the default could be ~100 or more (assuming we're able to batch-report, otherwise it should probably be lower).
  2. Implement a retrieval event execution goroutine pool: the executors can either execute queries or full retrievals. Queries or retrievals are "actions" that can be queued, the executors pick off queued actions in order and runs them.
  3. Action queue depth can be configured, the default could be in the order of ~100 or more.
  4. Number of executors can be configured, the default could be in the order of ~20.
  5. Calls to filecoin.Retriever#Request(cid.Cid) where the action queue is full should return with an appropriate error. Adding retrieval actions to the queue should not block or return an error - i.e. the queue could grow beyond full, but retrieval actions are a normal follow-up to query actions so they should be allowed through. The queue max enforcement should only be applied to entirely new retrievals.
  6. The Bitswap provider should not continue to wait for the block if the call to Request() errored on a full queue (this is the current actions - if error from Request() return a dont_have and maybe log).