Open hannahhoward opened 10 months ago
I'm going to push that on something like this, we shouldn't start on it until we have defined and implemented the metric(s) which tell us if there is a problem, how bad it is, and whether what we do fixes it.
If there's a reasonable hypothesis that this is, in fact, a problem with the service we should prioritize work in, then maybe we start this work with a task to get the data to build the case.
At the end of the day this feature (likely) improves tail TTFBs. Do we think that's a high priority investment area right now?
I understand this was suggested as a counter measure from a post-mortem on node operator error rates. Can we therefore clarify as to how much this proposed item addresses reliability and production improvement vs tail performance. I see that the following production problem is being addressed: "duplicate traffic -- we've even overloaded the log ingestor a couple times this way." The best way to evaluate this would be to assume 30 customers using the service worker (for example if we open the portal in the near-term). Will this cause more production problems and overloading the log ingestor even more?
What
Currently our JS client offers two kinds of smart fetches:
We fetch through DNS, and then if the request completely failes (after 5 seconds) we retry as a fallback using a node return from the orchestator.
We immediately race both DNS AND multiple nodes returned from the orchestrator and take the request that returns a first byte (
We proposes a third "request hedging" approach:
Why
Our first approach suffers from being a very non-ideal experience -- while the fallback prevents a complete failure, it only falls back after such a long time (i.e. 5 seconds) as to provide a terrible experience for the user.
Our second approach we have found to generate a high amount of duplicate traffic -- we've even overloaded the log ingestor a couple times this way.
This approach essentially aims to improve the fallback experience of our first approach without incurring the problems associated with the second approach.
Cost
To get this done we would need to: