filecoin-saturn / caboose

A blockstore for distributing load
Other
12 stars 2 forks source link

~70% of block requests fail due to strn L1 routing error or timeout #42

Closed lidel closed 1 year ago

lidel commented 1 year ago

Problem

iiuc Rhea produces ~130% of expected HTTP errors when compared with the old setup (snapshot from bifrost-gw-staging-metrics, see Mean values):

Screenshot 2023-02-23 at 19-56-15 bifrost-gw staging metrics - Bifrost - Dashboards - Grafana

Hypothesis

Look at Mean error rates from Caboose. Success rate of 30%-25% feels painfully inefficient, and may explain end user errors.

2023-02-23-194140_3440x1440_scrot

Questions / Ideas

cc @willscott @aarshkshah1992 for sanity check / ideas how to mitigate

willscott commented 1 year ago

For your breakdown:

The mitigations you suggest for caboose behavior make sense as well. I'll break those into separate issues

willscott commented 1 year ago

Filed the final suggestions as #43 and #44

i think #43 probably makes more sense once we start fetching car files rather than blocks

masih commented 1 year ago

About code 0 i.e. timeouts, how can one differentiate between provider lookup timeout and retrieval timeout? @lidel do you have any suggestions?

lidel commented 1 year ago

@masih from the perspective of end user's HTTP client talking to ipfs.io both are HTTP 504 (Gateway Timeout).

If we want to bubble up the reason to the end user, then L1 and Caboose should pass the reason in the error response body.
bifrost-gateway returns 504 with wrapped error message in the text/plain response body:

https://github.com/ipfs/bifrost-gateway/blob/c305b3ba95dc13b06392975ab2bbb9b475e319a7/blockstore.go#L70-L87

lidel commented 1 year ago

Closing as we changed a lot and now working on CAR + Block Fetch backend which needs fine-tuning first:

Screenshot 2023-04-05 at 00-36-34 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Will fill new issues for this.