MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Test communications when root key pair is renewed and NodeID is changed #317

Closed emmacasolin closed 2 years ago

emmacasolin commented 2 years ago

Specification

When the root keypair is renewed, we expect the old nodeId and client certificate to still be valid (in constrast to when it is reset).

From https://github.com/MatrixAI/js-polykey/issues/312#issuecomment-1016137891:

Changing the node id will impact communications between:

Under Agent to Agent, this is going through the forward and reverse proxies, and also the GRPCClientAgent. The GRPC client does not use TLS but the TLS is being done between forward and reverse proxies. This means that when we perform a keyManager.renewRootKeyPair, the old NodeId is still valid until the certificate expiration date. This means nodes contacting the changing agent with the old NodeId should still work, but receive an automatic update to the NodeGraph when the node id changes. I believe the logic to check the cert chain is in network/utils.ts verifyServerCertificateChain, but the logic to update the node id in the nodegraph is not verified yet.

Under CLI to Agent and GUI to Agent, they are using direct TLS on the GRPCClient. Existing GRPCClient connections should continue to work and not be interrupted by the TLS change. I believe I had tests for this under tests/grpc/GRPCServer.test.ts as changing the private key and certificate on the fly. However new connections will need to be made with a new target Node ID. This can occur by reading the status each time a new connection is made. However I think a similar TLS verification logic could take place where the old node ID can still be valid.

So we need tests to ensure this is the case. One for tests/grpc/GRPCClient.test.ts to ensure that the GRPCClient node id can still work. Another for tests/network as well.

Then we would extend these tests into tests/nodes for NodeConnection, and also in tests/bin to test CLI situation.

Additional context

Tasks

  1. Check logic for updating the certchain and node id
  2. TLS verification to allow old node ids to still be used
  3. Tests for GRPCClient, network domain, nodes, and CLI
CMCDragonkai commented 2 years ago

From https://github.com/MatrixAI/js-polykey/pull/326#issuecomment-1041133623

@joshuakarp when the await this.nodeManager.refreshBuckets(); is executed, it says it must be done upon key renewal.

However how does it get access to the new key? Is it calling the KeyManager internally? For something like this, it'd make sense to pass the the new key if it has to use it. Also is there any downsides to this? Seems like something to incorporate into #317.

This should be incorporated to the testing here.

CMCDragonkai commented 2 years ago

Changes from #326 will now bring in recovery code information when the keypair changes too.

But this revealed that pk keys commands should also report the new recovery code if the keypair changes. This is something that should be checked too.

CMCDragonkai commented 2 years ago

Any existing connections proxy connections or node connections should continue to work without problems, however asking for the node ID may give a different answer than the expected original node ID during connection establishment.

Furthermore, if a node is a aware that there is a newer node ID, it should proceed to update its own NodeGraph with this information, as well as any pre-existing relationships that the old node ID had. So for example if old node ID was part of a gestalt that had certain permissions, then the new node ID is now part of that gestalt as well.

The user should be notified somehow about the fact that they are using an old node ID. And that they should switch to the new node ID. However if they continue to try to contact the old node ID on the CLI or GUI, then it should still work as long as the old node ID is still part of the valid certificate chain. This should mean that we don't delete the old node ID.

Any garbage collection of the old node ID should only occur lazily. Not sure exactly when, but perhaps as part of a natural forgetting process of the PK network.

tegefaulkes commented 2 years ago

There are 3 kinds of key reset.

We expect there to be issues with chaning the root keyPair of the receiving side since that's the side that is authenticated. The conditions for a failure to authenticate would be the target having a new keyPair AND the old one not in the cer chain.By theses conditions only resetRootKeyPair will cause a node connecting with old details to fail.

I'd expect RenewRootKeyPair to still authenticate properly since it will keep the only cert on the cert chain. But currently it is failing so I need to look into it. I suggests that the authenticate process is not checking the old certs in the chain properly. Or the old cert is not properly maintained in the chain.

When trying to connect with bad authentication we get the following error out of the proxy in the logger ERROR:NodeConnection test:Failed CONNECT to 127.0.0.1:34670 - ErrorCertChainUnclaimed: Node ID is not claimed by any certificate. But this error doesn't propigate down to the NodeConnection so as far as the NodeConnection can tell it's just failing to connect and times out. Shouldn't we be throwing some kind of authentication error instead?

Also just a note. I can get the tests done today but that's just to check that things stil work as expected when reseting the root keypair. The aspect of this issue where a connecting node will notice that the nodeId/cerChain has been updated and reacts accordingly still needs to be speced out and done. I don't think that it is a strict requirenment for #378 so we can extract that part of the issue out and do it at a later date.

CMCDragonkai commented 2 years ago

@emmacasolin this NAT testing should eventually incorporate this as well, so that when a keypair changes, any connection should continue to work even through NAT busted connections, or eventually through NAT relayed connections.

tegefaulkes commented 2 years ago

There is a problem with verifying the certchain when establishing a connection. There seems to only be 1 cert in the cert chain. Even when there are multiple certs in the certChain in the KeyManager. I'm digging into it.

It seems like the receiving side only provides one certificate for the connection. We should be providing the whole chain.

tegefaulkes commented 2 years ago

So the keychange is meant to affect the sigchain and gestaltgraph in some what. lets start by asking what these are...

CMCDragonkai commented 2 years ago

Yea key change is a big problem, but that's why I wanted to focus only on maintaining the node connection for now and tackle the key change problem for the other domains separately. Good to keep these in mind for a new issue.

tegefaulkes commented 2 years ago

Ok then, all the connection tests are working except for one case. I'll fix that and then this should be done. The details in the comments above can be part of a new issue tacking this general key change problem.

CMCDragonkai commented 2 years ago

A portion of this work was done in #378, only addressing maintaining node connections while keypair is changed. @tegefaulkes please create a new issue for general keypair change testing.

tegefaulkes commented 2 years ago

Essentially what was addressed in this issue was handling the root cert changing and by extension the local Agent's NodeId. This means that when the root cert is changed we

  1. Update the status file with the new information.
  2. Update the tlsConfig for the grpcServerClient and Proxy
  3. Reset the NodeGraph buckets with NodeManager.resetBuckets().

What hasn't been addressed is handling remote nodes who's NodeId's have changed. We need to be able to

  1. detect when a node's NodeId has changed.
  2. treat the new NodeId as we would with it's old NodeId.
  3. update our internal information having anything related to the old NodeId be updated to use the new NodeId. This process needs to happen lazily. Domains such as ACL, Gestalt Graph and SigChain will need to handle this.

I will create a new issue specifying what still needs to be done.

CMCDragonkai commented 2 years ago

Please reference the files where these edits were made too. Also when you create the new issue referencing this one, close this one.

tegefaulkes commented 2 years ago

I worked on adding tests for this issue. the bulk of the work was done by Emma. As far as I know these changes were made.

I'm writing up the new issue now.

tegefaulkes commented 2 years ago

New issue has been created at #386.