Thundering herd with new content entering the network...

pipermerriam commented 11 months ago

What is wrong

https://github.com/ethereum/trin/pull/1056

IIUC as new content enters the network, we get a bit of a "Thundering Herd Problem" where a node will receive many OFFER messages for a piece of content, and since they are processed concurrently, a node may end up accepting many of them at the same time.

This is wastful of both local machine resources as well as the remote machines resources.

How can it be fixed

How wrong is it?

I propose that someone get some metrics on the magnitude of this problem. One methodology to measure this would be simply quantifying how many redundant transfers happen to get numbers on total wasted bandwidth.

Do Nothing

If we cannot determine that this is causing quality of service problems in our network today, we probably shouldn't spend much time trying to fix it as there are higher priority issues.

Idea A: Naively Limit to only one inbound transfer. (BAD)

A naive approach to solving this would be too implement a sort of LOCK that only allowed for one inbound OFFER/ACCEPT for a given piece of content. Subsequent OFFER messages would be rejected by the client if an inbound message is underway.

This is exploitable by a malicious client sending OFFER messages, intentionally slowing down the transfer rate, and either terminating the transfer before finishing it, or sending invalid data. The receiving client would have already rejected the other legitimate offer messages and would lose out on the content.

Idea B: Concurrency Limits (MEDIOCRE)

Similar to Idea A, but instead of a LOCK we use a SEMAPHORE. Rather than limiting to just one (1) inbound transfer, we limit it to N inbound transfers. This somewhat suffers from the same problem as the LOCK based solution, though it might be more robust against attacks, but I suspect that it would be very minimally better. It would still be reasonably easy for an attacker to fill up this concurrency limit with malicious OFFERs.

Idea C: Queueing

In this model we implement either a LOCK or a SEMAPHORE approach to limit the concurrency of inbound OFFER transfers, however, as duplicates arrive, we keep track of who offered us content and REJECT the duplicate OFFERs. If the current transfer fails, we use FIND_CONTENT to request the content from one of the other nodes that OFFER'd us the content since they "should" have it.

Idea D: Intelligent Swarming

This idea is not fully formed

The idea would be to introduce some kind of intelligent swarming behavior. The current swarming behavior produces a thundering herd since there's not mechanism to try and prevent many concurrent offers from happening at the same time.

Maybe there is a solution where we have the clients introduce an artificial delay before sending another node an OFFER message. Suppose we used the distance(node_a, node_b) to generate a random but weighted distribution resulting in:

delay of 0-10 seconds
nodes very close together would tend towards 10 seconds
nodes very far apart would tend towards 0 seconds

The thinking here is:

If I am very close to another node, there is a good chance that they have already gotten the content from someone else in the neighborhood. I will still offer it to them, but I will do so more often at a delay.
If two nodes are very far apart we want to spread the content quickly so that they in-turn can get the content spread around their local part of the network.

pipermerriam commented 11 months ago

Idea D should really be split into a dumb and sophisticated version. The "dumb" version would simply be to introduce a simple and completely random delay in the gossip logic. The written one above is a more sophisticated approach that may-or-may-not actually be a good idea or even better than the dumb random approach.

No matter what approach we pick, we'll want to make sure we have the tooling/instrumentation in clients prior to implementation so that we can actually measure whether we fixed the problem.

morph-dev commented 11 months ago

In the idea C (Queueing), is it safe to assume that node that offered it should have it as well? What if some node decides to contribute to the network as "distributor", but doesn't actually stores much data long term? Why not just query the network (standard way) for the key (but sure, we can start from the one that offered it)?

How about Idea E: Reputation Based Having some reputation system, where we would still process OFFERS in parallel, but accept new OFFER only if it is coming from a more reputable node. My feeling is that some basic reputation system would be enough for this use case.

pipermerriam commented 11 months ago

Reputation in portal network is difficult because I believe there will be minimal opportunity to build it as I don't think there will be a lot of repeat interaction outside of your closest neighbors. So maybe we can build a little bit of a reputation system for nearby nodes, but it will only be a positive indicator and cannot likely be used for negative attribution or negative filtering purposes.

Worth noting that this is derived from my mental model of the network and not actual real-world network activity and should probably be validated at some point.

So maybe yes for our nearest peers... though I'm not sure this is the right solution since the real goal here is to try and find something as close to the optimal solution of only processing one OFFER/ACCEPT pair for any single piece of content so-as-to not waste bandwidth.

pipermerriam commented 11 months ago

As for your question about "distribution only" nodes... yes, this could be a small problem, but I don't think it is actually significant.

In my mental model, the "bridge" nodes are acting this way, and thus, unless there are a LOT of bridge nodes, I think it's unlikely that it is the bridges that are actually the thundering herd. I suspect it's actually the re-GOSSIP that produces the thundering herd since that is where you would get the peak traffic of nodes OFFER-ing content, once it has been injected, but before it has been effectively spread to all interested nodes.

If it does turn out that the bridges end up being part of the "thundering herd" problem then we can try to engineer a solution around that... I have some ideas... I think it's solveable... but I'd want to see this actually measured in the live network so that we know we're fixing the right problems.

morph-dev commented 11 months ago

Maybe a bit off-topic, but I have some idea for the state network, not sure if good or bad, would have to write details of it and get feedback from the team.

Idea is basically this: Let's say there was up an update to a leaf node in MPT. Bridge would gossip just that leaf to the network (together with the proof). Nodes that accept it would just store the node itself (not the proof), and would gossip the same leaf and the leaf's parent (and the ones that accept leaf's parent would gossip leaf's parent and grandparent, etc...).

If this would be the case, then it wouldn't be only the bridges that would be "distributor", but potentially every node.

As I said, this is something that crossed my mind, not sure if it is good idea or bad. Bridge itself can gossip all MPT nodes separately, so there is no need to this approach. However, I can see how similar idea might be used in some other context.

kdeme commented 11 months ago

I think it is probably not the bridge as first step that would hit this. Unless the transfers to n nodes are not occurring concurrently in Trin bridges. Basically, if a node can receive + validate + initiate resend faster than the bridge can initiate all its transfers.

Why not just query the network (standard way) for the key (but sure, we can start from the one that offered it)?

Yes, this is an option, and less complex as you don't need to keep track of the offering nodes. Comes with the downside that you will have to do some hops in the network.

I was planning on implementing a version of option C, perhaps the easier one that you suggest.

I propose that someone get some metrics on the magnitude of this problem.

I agree, I've mostly (only?) spotted it on local testnet testing, but it is more likely to occur there due to ~0 latency. Depends basically on the n nodes in NHGossip, latency and the size of the content. In fluffy we cannot actually log this currently in history network, but I've seen it on our beacon network as there the validation will fail.

KolbyML commented 11 months ago

In our trin call on monday we discussed that we could quickly implement a solution where We cap total in bound uTP connections to X, then do something similar to Idea B.

This can be implemented in minimal time to prevent short term issues and a better complex can be implemented once needed.

Also this prevents Thundering herd happening from good actors which is our current main priority. If a good actor spins up 100 trin nodes they will all experience and cause Thundering herd. Another solution could be initially starting with a lower radius then 100. As a radus of 100 makes this very prominent till the radius lowers to a reasonable level.

pipermerriam commented 7 months ago

We came up with a new solution to this yesterday.

First, some background.

Attack Surface with Queueing solutions and active retrieval.

The basic attack is that a malicious node qickly sends an OFFER for new content faster than other nodes. The client receives this offer first, and then receives some number of genuine OFFER messages later. They accept the first malicious one and start a UTP transfer.

The malicious party can control the rate of transfer to ensure that all queued genuine OFFER messages timeout.

The malicious party can fail to send the final bytes of the data stream which is indistinguishable from honest client behavior in unstable network conditions.

The UTP transfer eventually fails.

If the client logic implements a fallback to using FINDCONTENT to retrieve the content, we must be extra careful not to introduce an amplification attack. The FINDCONTENT requests should only be sent to nodes that sent an OFFER, otherwise, it would be possible for a malicious node to send bogus OFFERs which would then trigger FINDCONTENT requests to unsuspecting nodes which is an amplification attack.

State Network Specific Problems

The second order problem here comes into play with state network having a different payload for OFFER and FINDCONTENT. In the context of the state network, a client cannot simply send a FINDCONTENT request to one of the nodes that previously offered the content, because the response would not be canonically anchored.

The only viable way that I can think of to provide that anchoring would be for the requesting node to build the anchoring proof themselves... however this is an amplification attack since it would mean requesting data from other nodes that weren't part of the initial set of OFFER messages.

I don't currently know how to solve this problem

The New Solution

The solution tries to fix this on the sending side via introduction of artifical delays in sending an offer message.

We want content that is being offered from far away to arrive fast since we want to quickly get content to the part of the network where it belongs.
We are ok with content being offered from very nearby to be slower since that means the content is already near it's final destination.

Thus, we want a distribution that is something like this.

Screenshot from 2024-04-04 12-21-46

When a node is going to OFFER content to another node, they randomly generate a delay, probably between 0-5 seconds that is weighted based on this graph.

If that offer is to a node that is very far away, they will very likely roll very close to 0, and send the offer very soon.
If that offer is to a node that is very close to them, they will very likely roll higher number, at which point they sleep for that duration and then send the offer.

If my mental model for this is correct, this should have the desired effect of still having gossip happen very quickly, but would reduce the likelihood of having many concurrent offers for new content come in at the same time. By the time the many offers come in, the client should already have received the content and would simply reject the late offers.

This also has the nice benefit of having some late offers that will roll in, which could allow a node to accept them if one of the previous offers/accept/UTP stream failed for any reason.

ethereum / portal-network-specs