threshold handling during message processing?

mortelli commented 5 years ago

Hello all,

After discussion with @holisticode, @vojtechsimetka, @diegomasini and others I thought it would be a good idea to document the issue of SWAP thresholds (see #1440) and open up the discussion to everyone.

I'm attaching a flowchart which lays out the flow for message processing considering both the payment and disconnect thresholds, as we understand and are building them today.

Diagram

thresholds-SEND MESSAGE

Notes

The diagram is for sending messages, but the process for receiving is analogous.
- Instead of peer.Send, peer.handleIncoming is called.
- Instead of accounting.Send, accounting.Receive is called.
- Instead of the message being sent as the final step in the process, the message is handled.
Marked in yellow are the decisions specifically related to thresholds.
- In this context, "over" means "greater than or equal to".

Discussion points

Based on this flow, it follows that:

Messages can be accounted for, even if they are not successfully sent or received. This is because there are multiple points of failure (some outside of this diagram) that come into play after the accounting is done. 1.1. This can cause a discrepancy between the balances of each of the nodes in an exchange. In turn, it can lead to further problems, such as a node requesting a cheque from another, the latter possibly rejecting it since it would leave it with a negative balance.
The disconnect threshold is checked before the accounting is done. Otherwise, each message rejected due to a tilted balance would tilt it even further, even though the message is dropped right after. 2.1. As a result, the first message that sends the balance over the disconnect threshold will still be sent/processed. Only messages that follow that one (assuming no payment is done) will be dropped.
Peers are not only dropped when the disconnect threshold is reached, but also in the case of other errors occurring, such as the balance not being loaded or saved successfully.

diegomasini commented 5 years ago

Does it make sense to just drop a peer because the balance cannot be loaded / persisted? Such situation should put the entire failing node offline, as far as I understand it failing to load / persist balances may represent an I/O error and thus a situation that needs to be handled by the owner of the node.

holisticode commented 5 years ago

Please review @zelig @homotopycolimit @nagydani

holisticode commented 5 years ago

@diegomasini, @mortelli just was depicting the current state.

But as a matter of fact, if we have a corrupt database, or something, then how do we deal with it in a clean way? Probably we should not deal with the remote peer until the issue is fixed, or balances will diverge, resulting in some (possibly more painful) drop later. The simplest approach is to drop the peer right away.

Having said that, I am in favor of a better handling, but then we need to specify what it could be.

diegomasini commented 5 years ago

I agree on dropping the peer, but it should also trigger a mechanism to put the failing node offline. To avoid other inconsistency errors.

mortelli commented 5 years ago

[...] Probably we should not deal with the remote peer until the issue is fixed, or balances will diverge, resulting in some (possibly more painful) drop later. The simplest approach is to drop the peer right away.

Keep in mind that even though the remote peer is the one that is dropped in this case, the failure to load or save a state indicates a problem in the local node, not the remote one.

diegomasini commented 5 years ago

[...] Probably we should not deal with the remote peer until the issue is fixed, or balances will diverge, resulting in some (possibly more painful) drop later. The simplest approach is to drop the peer right away.

Keep in mind that even though the remote peer is the one that is dropped in this case, the failure to load or save a state indicates a problem in the local node, not the remote one.

Exactly, that's my point.

cobordism commented 5 years ago

discrepancies in perceived debt balance between two peers is not fatal, as long as the disagreement amount is small compared to the payment thresholds (or more specifically smaller than the difference between payment and disconnect thresholds).

but of course it is not a desirable state of affairs.

my gut tells me: don't be too eager to drop peers. avoid it whenever possible.

holisticode commented 5 years ago

The question if you don't drop peers - what do you do? If we would just ignore peers, isn't that equivalent to just drop peers, because we would essentially not be dealing with them? If not dropped, we waste resources in maintaining the connection, the peer remains in the kademlia which marks it as a candidate for routing/syncing, but we block communication with it - I can't see the value of not dropping the peer so far.

vojtechsimetka commented 5 years ago

discrepancies in perceived debt balance between two peers is not fatal, as long as the disagreement amount is small compared to the payment thresholds (or more specifically smaller than the difference between payment and disconnect thresholds).

but of course it is not a desirable state of affairs.

my gut tells me: don't be too eager to drop peers. avoid it whenever possible.

@homotopycolimit are you suggesting that when a threshold is reached, the settlement is not for the full amount hence allowing some discrepancies in perceived debt balance? Because if the settlement is for the full amount any discrepancy would likely cause disconnect.

cobordism commented 5 years ago

I think so, yes.

Let's say that 80 is the threshold where you require payment. Assume also that you think your peer owes you 80 and they think they owe you 76. Now assume you ask for payment and they send you a cheque over 76, (or even 26 for that matter); why would you disconnect?

vojtechsimetka commented 5 years ago

Ah so you are suggesting the nodes just ask "pay me what you think you owe me" rather than "pay me x amount which you owe me". Brilliant! @holisticode @mortelli @diegomasini.

holisticode commented 5 years ago

But this is at a different level.

If you reach the payment threshold, the debtor issues a cheque. No disconnection takes place. If the creditor receives the cheque, the balance is restored to 0 (zero). If the cheque would not the full payment threshold value, the balance is restored to the difference (say 4). That is why the cheque has an Amount field.

So when a node sends a ChequeRequestMsg, it claims that a debtor reached payment threshold. Thus the debtor needs to check that:

It actually has a negative balance with the peer (otherwise ignore)
It checks its own balance with the peer, and emits a cheque. If its own balance is not equal the threshold value (it could also be more as more services may have been consumed since the request), it emits the cheque for an arbitrary value - could be bigger than threshold or lower. No disconnection takes place.

Disconnection is something completely different and happens at the disconnect threshold, which is some allowance above the payment threshold, and happens if no cheque at all has been sent and thus the debtor continues consuming from the creditor until it reaches that threshold.

And most of the above discussion was about a corrupt database, which as of decision in meeting https://github.com/ethersphere/user-stories/blob/master/doc/incentives/sync-2019-06-25.md is a fatal error, and thus should result in the local node going offline (and a remote peer drop first).

cobordism commented 5 years ago

If we would just ignore peers, isn't that equivalent to just drop peers, because we would essentially not be dealing with them? If not dropped, we waste resources in maintaining the connection, the peer remains in the kademlia which marks it as a candidate for routing/syncing, but we block communication with it

I never said 'ignore them'. I suggest refusing to serve retrieval requests coming from the offending node, other traffic is still allowable. In particular if you want to get data from them and they serve it to you (thus reducing the debt) that's fine. You only disallow traffic that would further increase the debt.

Disconnection is fine in lower Kademlia bins but gets problematic in the most proximate bin. If we can maintain connectivity there at the cost of blocking certain types of request, then that's probably preferable to a disconnect. The problem is that too many disconnects and blacklistings in the most proximate nbhd breaks our routing assumptions.

holisticode commented 5 years ago

I never said 'ignore them'. I suggest refusing to serve retrieval requests coming from the offending node, other traffic is still allowable. In particular if you want to get data from them and they serve it to you (thus reducing the debt) that's fine. You only disallow traffic that would further increase the debt.

This suggestion makes much sense. You say though "I suggest" - can you formalize this into a requirement? I assume this implies getting consensus and approval from the research track, in possibly written form. How do you guys handle that?

@mortelli I suggest that in order to implement this, we would have to create a custom error, which is raised if the "disconnect threshold" would be crossed, then when both Send and Receive return (p2p/protocols/protocol.go), we check against that error and just return nil if that error occurred (the message would not be sent). We would have to log this with high priority (I suggest WARN, or does it need to be even ERROR?) so that it becomes visible that this is happening.

zelig commented 5 years ago

Ah so you are suggesting the nodes just ask "pay me what you think you owe me" rather than "pay me x amount which you owe me". Brilliant! @holisticode @mortelli @diegomasini.

we cannot be so liberal, but indeed we should find a way to correct minor discrepancies resulting from sent but not received messages

cobordism commented 5 years ago

we cannot be so liberal,

sure we can. after all, the request is really just "pay me some amount to get us back closer to zero balance (under the threshold)".

If nodes always pay the full amount they think they owe, and if the discrepancy came about due to random errors, then we can assume they will even out over time. It is only systematic errors that will accrue over time to the point where they become real obstacles.

mortelli commented 5 years ago

I never said 'ignore them'. I suggest refusing to serve retrieval requests coming from the offending node, other traffic is still allowable. In particular if you want to get data from them and they serve it to you (thus reducing the debt) that's fine. You only disallow traffic that would further increase the debt.

I think we're not properly dealing with this scenario based on the the current flow. We could have this situation:

While below the disconnect threshold, node A requests chunks from node B.
Node B serves chunks to node A until the disconnect threshold is reached (A owes B).
With no payments being made, suppose node B later requests a chunk from node A.
When node B attempts to send the chunk request message, it will see that its balance with node A is over the disconnect threshold (A still owes B) and will drop the message.

In summary, while A owes B, B will not be sending or receiving any messages unless a payment is made, even if this message would reduce the incurred debt.

What if we ignored the balance threshold when the message price is negative?

@mortelli I suggest that in order to implement this, we would have to create a custom error, which is raised if the "disconnect threshold" would be crossed, then when both Send and Receive return (p2p/protocols/protocol.go), we check against that error and just return nil if that error occurred (the message would not be sent). We would have to log this with high priority (I suggest WARN, or does it need to be even ERROR?) so that it becomes visible that this is happening.

I like this. We could use it to make a distinction between errors raised during the accounting process and other errors–maybe this could be a way of dropping chunk messages but allowing the rest.

We could drop the messages for these custom errors, but:

When a threshold error is raised, do we drop the peer as well?
What do we do about messages with other types of errors? And their peers as well?

mortelli commented 5 years ago

Here are some potentially problematic situations. Please review if possible.

Starting point: Node `B` owes node `A`

Suppose nodes A and B have just gone over the disconnect threshold, due to a chunk being sent from A to B, resulting in the following balances:

nodeA.balances[nodeB] > disconnectThreshold
nodeB.balances[nodeA] = nodeA.balances[nodeB] * -1

Assume no payments were or will be made.

Situation 1: node `B` wants to request another chunk from node `A`

B will send a Retrieve Request Message to A for the chunk.
This type of message is payed by the sender.
Since the balance from B to A is negative at this point, it cannot be higher than the disconnect threshold.
After accounting for the price of the message it's sending, B will increase its debt to A. The message will then go through B's Send method.
A will receive the Retrieve Request Message from B for the chunk.
Since the balance from A to B is positive and over the disconnect threshold, A will drop B as a peer and skip the message accounting and handling. It will not go through A's Receive method.
As a result, the chunk will not be successfully requested, let alone transferred. However: there is already a discrepancy in the balances, since B accounted for a message that A did not.

Situation 2: node `A` wants to request a chunk from node `B`

Suppose A has not dropped B yet, or has added it as a peer again after doing so.
A will send a Retrieve Request Message to B for the chunk.
This type of message is payed by the sender.
Since the balance from A to B is positive and over the disconnect threshold, A will drop B as a peer and skip the message accounting and sending. It will not go through A's Send method.
As a result, A will not be able to request a chunk from B, even though doing so would reduce B's debt to A.
Proposal: when accounting, ignore the disconnect threshold if the message price is negative. This would mean that when the balance is positive, we'd be letting through messages which would reduce the debt.

Thoughts?

mortelli commented 5 years ago

Following up on this one:

Here are some potentially problematic situations.

Starting point: Node B owes node A

Suppose nodes A and B have just gone over the disconnect threshold, due to a chunk being sent from A to B, resulting in the following balances:
nodeA.balances[nodeB] > disconnectThreshold
nodeB.balances[nodeA] = nodeA.balances[nodeB] * -1
Assume no payments were or will be made.

We now have a PR that will cover this scenario:

Situation 2: node A wants to request a chunk from node B

Suppose A has not dropped B yet, or has added it as a peer again after doing so.

A will send a Retrieve Request Message to B for the chunk.

This type of message is payed by the sender.

Since the balance from A to B is positive and over the disconnect threshold, A will drop B as a peer and skip the message accounting and sending. It will not go through A's Send method.

As a result, A will not be able to request a chunk from B, even though doing so would reduce B's debt to A.

Proposal: when accounting, ignore the disconnect threshold if the message price is negative. This would mean that when the balance is positive, we'd be letting through messages which would reduce the debt.

This one should be solved through #1922 (which closes #1921).

ethersphere / swarm