filecoin-saturn / js-client

A simple request–response client for Saturn, written in JavaScript
Other
8 stars 4 forks source link

Request hedging #52

Open hannahhoward opened 10 months ago

hannahhoward commented 10 months ago

What

Currently our JS client offers two kinds of smart fetches:

  1. We fetch through DNS, and then if the request completely failes (after 5 seconds) we retry as a fallback using a node return from the orchestator.

  2. We immediately race both DNS AND multiple nodes returned from the orchestrator and take the request that returns a first byte (

We proposes a third "request hedging" approach:

  1. Initiate a request with DNS
  2. If a time equaling Saturn's P90 TTFB passes without receiving a first byte, start a second request to an orchestrator node, and take which ever returns a first byte first (cancleling the other)

Why

Our first approach suffers from being a very non-ideal experience -- while the fallback prevents a complete failure, it only falls back after such a long time (i.e. 5 seconds) as to provide a terrible experience for the user.

Our second approach we have found to generate a high amount of duplicate traffic -- we've even overloaded the log ingestor a couple times this way.

This approach essentially aims to improve the fallback experience of our first approach without incurring the problems associated with the second approach.

Cost

To get this done we would need to:

reidlw commented 10 months ago

I'm going to push that on something like this, we shouldn't start on it until we have defined and implemented the metric(s) which tell us if there is a problem, how bad it is, and whether what we do fixes it.

If there's a reasonable hypothesis that this is, in fact, a problem with the service we should prioritize work in, then maybe we start this work with a task to get the data to build the case.

At the end of the day this feature (likely) improves tail TTFBs. Do we think that's a high priority investment area right now?

prodalex commented 10 months ago

I understand this was suggested as a counter measure from a post-mortem on node operator error rates. Can we therefore clarify as to how much this proposed item addresses reliability and production improvement vs tail performance. I see that the following production problem is being addressed: "duplicate traffic -- we've even overloaded the log ingestor a couple times this way." The best way to evaluate this would be to assume 30 customers using the service worker (for example if we open the portal in the near-term). Will this cause more production problems and overloading the log ingestor even more?