Refactor send-message business logic in Connections

freimair commented 4 years ago

_This is a Bisq Network project. Please familiarize yourself with the project management process._

Description

The business logic of sending messages needs some love. During https://github.com/bisq-network/bisq/pull/4047, it became apparent, that the logic might miss sending messages entirely. Main takeaways from the project:

long term: ground work that improves reliability.
short term: maybe get rid of spurious message loss (during trade or mediation)

Rationale

Why is it important?

Messages are submitted to the connection asynchronously, thus, chances are that messages do not get sent because threads are abandoned, killed, time out, or a connection is closed before all message in queue are sent
definitely explains why we see nasty walls of exceptions on app shutdown quite frequently
concrete example: removeOfferMessages on shutdown may or may not be sent, depending on the timeouts and therefore on the performance of the host, network load, message load of the bisq app, seed node load, ...
it can happen for more crucial messages (messages are messages are messages, there are no priorities built into them)
might explain why we see messages getting lost

IMO:

I consider this a high priority task
given my >1,5 years experience with the p2p part of Bisq
the p2p message handling needs cleanup and refactoring, technology is outdated, changing stuff is a minefield, there is synchronization everywhere which immediately causes deadlocks on the slightest change, attack counter measures are scattered throughout the code to make it almost impossible to understand how/if they work, yet alone understand, control and tweak them, copy-and-pasted spagetti code provides plenty of places for bugs to hide in
however, I cannot provide a concrete issue # that will be fixed by working on this

Why should it be done now? What will happen if we don't do it or delay doing it?

consider it as basic maintenance
thus, no, it does not have to be done now
delaying it will work as well

however,

we might just see a more robust network
less lost messages
eliminate unforeseen deadlock situations
confine timing issues to the Connection, where they can be (at least) handled somehow
track messages and see if they are actually sent

Criteria for delivery

[ ] have a test suit for message sending BL
[ ] more robust code

Measures of success

cleaner code
maybe catch a few bugs we are not aware of yet

Risks

as always, changing the P2P part of Bisq is highestest risk
this one only touches the message sending business logic, so the risk is somewhat confined. Yet, if nobody can send messages, the network is going to die as well.

Tasks

[ ] create test suit for message sending business logic (ie. on Connection level)
[ ] implement a proper message queue for messages to be sent
[ ] implement a proper connection shutdown process, move away from dropping anything instantly
[ ] gradually remove "external" message scheduling mechanics

Estimates

hard to say, as the project will only show its true face once we are knee-deep into it.

Task	Amount [USD]
create test suit	1800,00
message queue	900,00
remove "external" scheduling	1200,00
testing	700,00
other	500,00
total	5100,00

Notes

supersedes https://github.com/bisq-network/bisq/issues/4105
followup to https://github.com/bisq-network/bisq/pull/4047

chimp1984 commented 4 years ago

short term: maybe get rid of spurious message loss (during trade or mediation)

I doubt that it has a bigger impact on that. At shutdown of headless nodes there is/was no graceful shutdown, but they don't send crucial messages (trade, dispute). GUI clients do a graceful shutdown and the only case where it might be critical is when a crucial message was sent and the user immediately shuts down the app (or kills it hard). Even with a graceful shutdown there should be enough time to deliver the messages. The message queued up are usually not much (batching did not work as expected, and most of the time there is no batching).

freimair commented 4 years ago

Even with a graceful shutdown there should be enough time to deliver the messages.

actually, before https://github.com/bisq-network/bisq/pull/4047, even with "graceful shutdown", tor has been terminated before all messages have been flushed out. No RemoveOfferMsg, no CloseConnectionMsg. And, there is no central queue and there could be severe consequences to that. Here is the scenario:

A critical message might be "queued" in a UserThread.runAfter(>0, connection.sendMessage(.))
Thus, the connection does not know about the message yet
The business logic triggering the UserThread.runAfter assumes it has been sent (and depending on the message, may also memorizes that information)

Now, given, the client gets shut down before the message is sent, the business logic has no way of knowing that the message hasn't been sent (so no resend). Thus, we have a "lost" message.

The message queued up are usually not much

Give it enough time and trials and it will happen.

At shutdown of headless nodes there is/was no graceful shutdown

The issue we have been/are facing here is that the data store files got corrupted frequently. A graceful shutdown did help some. However, #25 and #29 will complement this very issue.

chimp1984 commented 3 years ago

@ripcurlx @cbeams Can we close that project?

cbeams commented 3 years ago

Closing as rejected.

bisq-network / projects