Prevent RPC calls for Unauthenticated `NodeConnection`s (Segregated Network Connections)

amydevs commented 1 month ago

Specification

In order to segregate nodes of different networks using the ClaimNetworkAccess tokens defined in https://github.com/MatrixAI/Polykey/issues/779, there needs to be some logic for nodes to prevent accepting RPC calls from nodes that are out-of-network. There will need to be some RPC calls that are whitelisted to enable some level of out of network negotiation such as authenticating into the network or requesting access.

Currently a NodeConnection has two stages, the creating stage when you are awaiting the createNodeConnection factory. And the connected stage when the NodeConnection fully connects and the creation resolve. The problem with this is we need to be able to authenticate without allowing most RPC calls and nominal traffic.

To this end we need a connected but unauthenticated state where the NodeConnection is fully operational but not entertaining RPC calls by rejecting them. The NodeConnection.createNodeConnection should negotiate it's access to the network before resolving with the completed connection. This means the connecting state and authenticating state is hidden inside the creation of the NodeConnection. This will mean minimal changes to how the NodeConnectionManager handles these connections. However, the NodeConnection will need separate connected, authenticated and created events to reflect these stages.

Connection event should be recorded by the audit domain as a connection
Authenticated event should be recorded by the audit domain as well. Probably a new authenticated event. There will be a distinct forward and reverse authentication event.
Not actually sure about creation, fills the same role as authenticated so subject to discussion.

While in the authenticating state, all non whitelisted RPC calls need to be rejected outright. That said, the NodeConnection shouldn't be available to make these calls at this stage but we still need to be secure about it. To avoid having to add logic to all of the RPC handlers we should apply this connection rejection logic to the middleware. The middleware will need to refer back to the NodeConnection somehow to assess the authenticated state.

To authenticate the NodeConnection needs to make an authentication RPC call and provide a valid network token. It's up to the handler of this to decide to reject the authentication with an error and kill the NodeConnection if it fails. Note that this needs to be symmetric in the forward and reverse direction. BOTH sides of the connection need to fully authenticate. This opens us to annoying race conditions so we need to be extra careful here. The handshake has been described within #779.

It is extremely important that the following conditions are met.

A NodeConnection is only fully created if it fully authenticates.
The NodeConnection is only added to the NodeConnectionManager's connection map if it fully authenticates.
The connection details are only added to the NodeGraph if it fully authenticates.
No non whitelisted RPC calls are allowed or made during the Authenticating state.

Additional context

Related: #779 - Defines the network access tokens and how they are verified. Related: #770 - Parent issue

Tasks

Add an authentication state to the NodeConnection.
Add middleware logic to RPC to only allow non whitelisted RPC calls to be handled if we are in the authenticated state. Otherwise kill the stream involved.
Add an RPC handler authentication the connections access to the network. This will resolve with no message, or throw an error for why it failed to authenticate. This should be whitelisted.
Review and expand on the creation events to include separate connection, authentication and creation events.
NodeConnection must switch to the authenticated state and fully create only after both the authentication handler and call succeeds and resolves.
We need to make sure timeouts work here. There are two levels of timeouts. It's up to the handler side to kill the connection if it fails, but the calling side can timeout and kill the connection as a fail-safe. We also need an overall timeout for the whole process so if it fails to authenticate fully before the timeout then we just give up and kill the connection.

linear[bot] commented 1 month ago

ENG-373 Network Awareness of `NodeManager`

tegefaulkes commented 1 month ago

@amydevs This will be easier to start with if the token logic is bending your mind. The only point of contact with the token logic here is the authentication logic utility function and the token payload. You should stub both of these out in testing.

CMCDragonkai commented 1 week ago

What's the status of this? @amydevs @tegefaulkes

tegefaulkes commented 1 week ago

This still needs to be worked on. I'll be taking it over while @amydevs starts on the PKE work with the DB domain.

tegefaulkes commented 1 week ago

With #775 being merged the underlying data structure for separating networks has been implemented. This issue will focus on Managing the connections in a way to keep the networks separate. It will also only allow certain RPC calls during the connection's unauthenticated state.

When this issue is completed then we should have everything implemented to separate public networks. Further expansion to the authentication token logic will need to be done to allow for private networks later down the line.

amydevs commented 1 week ago

After handing this of to Brian, we discussed over several options to implement this.

The most intuitive way would be to make sure that the static createNodeConnection function awaits for the authentication process to finish before returning with a NodeConnection.

However, this presents problems:

As the NodeConnectionManager own the RPCServer, as well as makes calls to createNodeConnection. The resolution of createNodeConnection promise depends on a handled call by the RPCServer. However, at that point, the NodeConnection has not been created yet, so there is no conceivable way to notify the createNodeConnection method call that authentication has finished.
In order to keep this data model, we would need to have the NodeConnection be available during the handling of the RPC call that authenticates the peer. This is not possible as the RPCServer instance is established per NodeConnectionManager rather than per NodeConnection.
The only other conceivable way to fix this is to pass an async callback/promise to the createNodeConnection that will resolve once the connection has been authenticated.

CMCDragonkai commented 1 week ago

I was thinking that node connections is a lower level, and just do your gated calls at the RPC layer. Remember it's an application layer concern that they aren't on the same network. You have to check their sigchain claim. I don't think node connections being a lower level concern should even be aware of this problem. Factor out the abstraction to solve this.

CMCDragonkai commented 1 week ago

I was thinking that node connections is a lower level, and just do your gated calls at the RPC layer. Remember it's an application layer concern that they aren't on the same network. You have to check their sigchain claim. I don't think node connections being a lower level concern should even be aware of this problem. Factor out the abstraction to solve this.

Importantly don't mix up abstraction layers. Otherwise the entanglement will cause modularity problems in the future.

tegefaulkes commented 1 week ago

I was hoping at the time to isolate the changes to the NodeConnection so we wouldn't need to modify the NodeConnectionManager at all. After discussing this with @amydevs and mulling over it. It seems that the best place to manage this logic will be in the NodeConnectionManager itself.

This means that the NCM will coordinate the NC authentication process and only make the connection available to be used after that has been completed. We should be able to get away with not allowing most RPC calls since we can prevent access to the NC in question until it has been verified to be part of the network.

CMCDragonkai commented 1 week ago

Can you disallow ALL RPC calls? Usually RPC authentication would be an RPC middleware. But if you put it into NCM, then you'd need to call the RPC in the NCM. That would also mean NCM ends up having knowledge about sigchain claims. I feel like that's too much knowledge built into NCM. Seems like an NM sort of thing to know. Remember NCM is just NCs, but NM can do higher level semantic operations like knowing about the sigchain and making interpretations on it. NM seems like a better place for all this logic.

tegefaulkes commented 1 week ago

We can, but there are two aspects to it.

If we can't access the node connection from the NCM then we can't make the call in the first place. That said we shouldn't entertain receiving them either.
We need to all for some RPC calls to allow negotiating the authenticating stage or requesting access to the network.

CMCDragonkai commented 1 week ago

I don't know what you mean by 1. But I thought we can make node connections established simply because of quic. Then sigchain has to be checked at a higher abstraction layer?

tegefaulkes commented 1 week ago

That's what we're doing, yes. There are two levels of authenticating the connection. The normal connection level done by QUIC which will be unchanged. And the higher level where we negotiate authentication for the network. 1. Is only saying that we can maybe get away without disabling the RPC calls since we can't make them without access to the NodeConnection and we can't get the NodeConnection from the NCM without it being authenticated.

But since we can't control when the reverse RPC requests can be made then we'd probably have to enforce it for the handlers via the reverse middleware anyway. May has well handle it for both directions at that point.

CMCDragonkai commented 5 days ago

When this is merged - that means testnet and mainnet is separated.

MatrixAI / Polykey