AppLayerLabs / bdk-cpp

MIT License
7 stars 12 forks source link

Fix multiple P2P bugs #121

Closed fcecin closed 2 months ago

fcecin commented 2 months ago

This PR in a nutshell:

Detailed:

Fix non-synchronized access of Session::socket (via single Session::strand)

The socket object from Boost ASIO is not thread-safe, so any use of any single socket object by more than one thread is required to be synchronized, otherwise we get UB (actually, it is worse than UB, which is "maybe UB", which depends on how much and how we are hitting the socket object and how much UB is avoided by implementation accidents).

One way to solve it is by only interacting with each Session::socket object through its singleton Session::strand inside the I/O handlers. This should be sufficient, given that (a) the I/O operations invoked in the handlers are already asynchronous (non-blocking), and (b) there should be no pressing need to achieve any further read-handler vs. write-handler parallelism while handling a single remote client (we have multiple peers to deal with at any given time anyways). If we still want to have intra-peer-connection parallelism in socket I/O handlers, then we can at some point explore using a mutex to protect the socket object (which is more error-prone and isn't clearly better).

P2P Double Connection problem:

When two nodes initiate OUTBOUND Sessions (i.e. connect()) to each other simultaneously, there is a chance they will register both Session objects of the same ConnectionType (OUTBOUND/INBOUND) on both ends. For example, they can register both OUTBOUND objects. What happens next is that both INBOUND sessions, at both ends, will be detected as redundant and closed, which results in both connections dying and e.g. the Discovery process restarting the process again. That was at the heart of both random disconnects after the sessions map was filled and random timeouts in tests while the Discovery process tries to get the ordering of events to go their way and avoid the P2P Double Connection problem by luck.

The solution involves creating a deterministic criteria on both ends to decide when to close an INBOUND connection or when to replace one's own previous OUTBOUND connection with the new INBOUND. This is decided on both ends based on the ordering of the NodeIDs.

P2P multiple outgoing connections problem:

Since ManagerBase::connectToServer() had no way of knowing whether a Session object was already created to carry out an async_connect() to a given remote NodeID, multiple calls to that method could create multiple such Sessions, meaning there could be multiple socket connection attempts to the same remote node in flight at the same time.

To avoid that, OUTBOUND Session objects are registered upon creation. That actually solves three problems: first, you won't create another connection if you already have one being attempted; second, it helps implement the solution to the Double Connection problem; and third, there's no reason we can't set the Session::nodeId_ in the constructor of OUTBOUND sessions, which helps with logging and diagnostics (the remote listen port number is just news after a handshake for INBOUND sessions).

Synchronization & state-transition problems:

Greatly improved synchronization code that should be all correct now, and implemented hopefully correct and complete state modeling and transitions for all P2P/networking objects.

Ensure no ~io_context vs. ~Session (~strand, ~socket) races can occur during ManagerBase::stop()

In tandem with revamped net state synchronization and state transition code, ManagerBase::stop() is now ensuring that io_context.stop() will not race with any ~Session destructor, which would cause a difficult-to-reproduce heap-use-after-free.

Ensure P2P application layer does not receive callbacks from stale/dead sessions

In tandem with revamped net state synchronization and state transition code, only (Session::registered_ == true) sessions will call back the ManagerBase to notify of events in the Session. Thus, non-registered sessions are prevented from communicating anything to the application and interfering with it.

Code refactoring & polishing

Many minor improvements around logging, setting socket options, handling errors, code organization, naming, etc.

Testing (so far):

About ~2000 [p2p] test runs, plus 40 full-suite runs (total, during development).

After the code freeze (just polishing the commit), ran another ~120 [p2p] runs and 10 full runs.

Tested with 1 net thread per node and with 20 net threads per node.

Passed an additional > 100 test runs of [p2p],[net],[rdpos],[state] with --netthreads 20

bdk-cpp-sonarqube-app[bot] commented 2 months ago

Quality Gate passed Quality Gate passed

Issues
0 New issues
14 Fixed issues
2 Accepted issues

Measures
0 Security Hotspots
55.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube