filecoin-project / lassie

A minimal universal retrieval client library for IPFS and Filecoin
Other
106 stars 18 forks source link

Include per-provider bitswap interactions in response timing headers #348

Open willscott opened 1 year ago

willscott commented 1 year ago

With https://github.com/filecoin-project/lassie/pull/332 we have response headers on a per-peer basis. We should make sure these also include attempts / observed bitswap peers as we get visibility into those connections.

In particular, if we get back bitswap peers from IPNI, but fail to connect to them, the response headers should indicate that / which peers couldn't be contacted / could-be-contacted-but-didn't-serve-data

rvagg commented 1 year ago

How useful is this kind of information if we only get back data on peers up until the point at which someone starts serving us data and therefore we write data out and spit out the headers in their current state? It's not going to be very representative, or are we just looking for scraps of information about peers over a large enough number of calls to make some kind of inference?

Event recording or other instrumentation would be a more reliable mechanism for this wouldn't it?

willscott commented 1 year ago

it will still make it visible when a peer breaks / becomes unavailable, because we'll start to see a bunch of connection attempts that don't resolve until the overall request either errors or completes with someone else.

hannahhoward commented 1 year ago

@willscott you want to note when the actual dial attempt for a peer in Bitswap doesn’t work? I’m not sure that’s currently accessible from within Bitswap, which handles the actual dialing. I guess we could attempt to dial ourselves also and record the result -- if Bitswap already dialed, the connect attempt would happen instantaneously, same the other way if our dial succeeded first.

hannahhoward commented 1 year ago

@rvagg assuming this lands on you, you could put dial attempts in https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/bitswaphelpers/indexerrouting.go and then dispatch events accordingly.

hannahhoward commented 1 year ago

https://github.com/filecoin-project/lassie/pull/345 adds something like this, but I think is not what you all care about -- it enables tracking of data received by peer including through bitswap.

willscott commented 1 year ago

that seems plausible - if slightly inefficient.

Two goals:

  1. in a failure case where saturn needs to render an error page, we want to say:

    • was found on providers <a, b, c>
    • Providers <a, b> could not be connected to
    • Provider was connected to but didn't provide the data.
  2. as a trailer that is just consumed on the saturn l1, there's a desire to know how upstream data was obtained / from which peer IDs.

hannahhoward commented 1 year ago

when we say failure case, are we talking about only non-200 responses for Lassie or also premature stream close errors?

I assume for non-200 responses, we can say that those we could not get data from are simply the diff between providers found and providers we couldn't connect to

For premature stream close, we received some data from some peers presumably -- but eventually found a block where no data was returned from anyone. Should we simply say we failed with all peers? Or maybe do we need the CID we failed on?

hannahhoward commented 1 year ago

I believe #345 is ultimately a blocker here, which in turn is blocked on https://github.com/ipfs/boxo/pull/308

rvagg commented 11 months ago

@hannahhoward @willscott thoughts on the status of this? We could separate it out into two separate questions—the continued use of server-timings and whether it's really the best tool to achieve the goals here, and the actual collecting of dial attempts within lassie.

We could do the latter separately to the former, maybe even record all the failed dial attempts as failed retrieval attempts in our database; although we're yet to see what kind of volume this information requires (I'm concerned it might be way too much information even just storing the successful block retrievals we've enabled with v0.18.0).

It might also be the case that pre-dialing the peers before handing off to bitswap could get us past some of the problems we have with bitswap that are currently requiring us to pin to a particular boxo commit.