celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
926 stars 924 forks source link

feat(shrex): client side throttling #2186

Open walldiss opened 1 year ago

walldiss commented 1 year ago

Implementation ideas

Currently for historical sampling, shrex client uses peers found by the discovery module. The default amount of peers found by discovery is 5 and those peers are random peers from the network that advertise themselves as full nodes. Historical sampling is controlled by DASer that tries to utilize full bandwidth of the node by taking advantage of multiple parallel workers. Some of the found peers could be slow or lazy (do not respond with data and just hang until timeout). The higher the proportion of bad peers, the more load is put on good ones. This could lead to very high load from single client to good full node and hit the rate limiting on server side.

One of the things that could help with shrex stability is client side throttling, that would limit amount of parallel request sent to same peer. Such safety feature could be done inside peer-manager.

renaynay commented 1 year ago

Some of found peers could be slow or lazy (do not respond with data and just hang until timeout). The higher the proportion of bad peers, more load is put on good ones.

I think the proper solution here is to do a similar scoring mechanism to peerTracker ( hopefully these mechanisms can be DRY'd at some point ) where we score peers from full node pool and eventually GC shitty ones, and trigger disc to find better peers.

Client-side throttling is basically a bandaid over the issue that there's not enough good hygiene around peers inside of peerman.

WDYT?

walldiss commented 1 year ago

You are right, peer scoring would help to minimise risk of the problem, but it still could happen with probability depending on config values and luck of randomly finding decent amount good peers right away.

I forgot to mention one more point:

Client-side throttling is not a bandaid, it is safety feature that allows graceful feedback that will reduce stress on good peers in non ideal situations. Same reason why we have rate limit on server side.