Closed lidel closed 1 year ago
For your breakdown:
The mitigations you suggest for caboose behavior make sense as well. I'll break those into separate issues
Filed the final suggestions as #43 and #44
i think #43 probably makes more sense once we start fetching car files rather than blocks
About code 0
i.e. timeouts, how can one differentiate between provider lookup timeout and retrieval timeout? @lidel do you have any suggestions?
@masih from the perspective of end user's HTTP client talking to ipfs.io both are HTTP 504 (Gateway Timeout).
If we want to bubble up the reason to the end user, then L1 and Caboose should pass the reason in the error response body.
bifrost-gateway returns 504 with wrapped error message in the text/plain
response body:
Closing as we changed a lot and now working on CAR + Block Fetch backend which needs fine-tuning first:
Will fill new issues for this.
Problem
iiuc Rhea produces ~130% of expected HTTP errors when compared with the old setup (snapshot from bifrost-gw-staging-metrics, see Mean values):
Hypothesis
Look at Mean error rates from Caboose. Success rate of 30%-25% feels painfully inefficient, and may explain end user errors.
Questions / Ideas
cc @willscott @aarshkshah1992 for sanity check / ideas how to mitigate