application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Optimizing Filecoin Retrieval TTFB #102

Open hannahhoward opened 2 years ago

hannahhoward commented 2 years ago

Currently, we use the following steps for retrieving data from Filecoin when we lack a CID in the local cache:

  1. Query the indexer/Estuary
  2. For every result returned, query each individual provider in parallel, but wait for all results to return.
  3. Retrieve sequentially based on a sorting function.

There are a couple ways we can optimize this:

One other thing to factor in is how we want to abstract the additional data returned by the indexer that doesn't come from estuary (I think). Honestly we should think about this problem in general since for example Estuary can have a different "Root CID" while the index is always the same.

elijaharita commented 2 years ago

we could add a field to RetrievalCandidate PossiblyFree bool or something along those lines. since RetrievalCandidate is returned by the indexer impl, it would be no issue to write endpoint-specific behavior. if estuary isn't able to provide the info, it would be as simple as just having the estuary endpoint always set PossiblyFree to false. the indexer endpoint impl would be able to set it properly.

we could immediately just attempt retrievals on all of the PossiblyFree == true candidates with pre-assumed retrieval params, and only if all of those fail, fall back to query + retrieval like what's currently done.