MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Incremental/Progressive Update using the Network Version of Polykey Agents #658

Open CMCDragonkai opened 11 months ago

CMCDragonkai commented 11 months ago

Specification

When updating Polykey we have 3 kinds of versions to deal with:

  1. Source Version in the form of MAJOR.MINOR.PATCH
  2. State version as per #287
  3. Network version currently set as serviceVersion

All 3 of these versions are already configured under src/config.ts:

  /**
   * Version of source code
   * This must match the package.json
   */
  sourceVersion: version,
  /**
   * Version of the state, persisted into the node state
   * It is only incremented on breaking changes
   * Use this to know if you have to do a schema-upgrade
   */
  stateVersion: 1,
  /**
   * Version of the RPC and HTTP service
   * It is only incremented on breaking changes
   * Use this to know if you must upgrade your service client
   */
  serviceVersion: 1,

The serviceVersion should probably called the networkVersion.

These versions need to be exposed and used as ways to allow "generations" exist on the network.

The first thing to do is to ensure that one can acquire this version information from agent status command - but also by RPC call by asking the agent. This is important for several reasons:

  1. It allows the PKClient to know whether it is compatible with the PKAgent.
  2. Other PKAgents will be able to know whether it is compatible with the PKAgent being contacted.

In order for both to actually be achieved, this version information actually has to be "earlier" at the transport level.

The idea is to encode this information into the root certificate that is being presented to at the beginning of the TLS handshake.

We can add the version information as properties to the root certificate. In particular the networkVersion should be present in the root certificate.

The custom TLS verification function can then check for this network version to see if it is in fact compatible. Then a custom TLS error should be raised if it is not compatible.

The idea is that the network version number represents a particular "generation" of PK agents. Each network version represents a compatible set of PK agents regardless of the source version or state version. We keep it simple like this so that network version of 1 can only talk to other agents presenting network version of 1, and network version of 2 can only talk to other agents of network version of 2. Same with PK client, a PK client of network version 1 can only talk to PK agent of network version of 1. This keeps it simple, we won't bother with some sort of backwards compatibility between network versions.

This means our software can change, while also keeping the network generation the same if they are truly compatible with each other.

In the network dashboard #599 it should also be showing this version information of all the seed nodes. In fact each generation of the network should have a separate dashboard as they would create entirely separate networks. This allows us to create a upgrade/deprecation policy, where the new network might be version 2, while the old network of version 1 will have 3 months to upgrade before the seed nodes get terminated. This doesn't prevent users of version 1 to keep operating, they just won't be able to connect to the seed nodes.

This also means when you connect via DNS to testnet.polykey.com and look at the SRV records, you probably want to connect only to the seed nodes that are relevant to your network version. Right now this information isn't in the SRV record metadata. So that should be there too. However it should be able to gracefully connect to the nodes that are of the right network version, and drop connections to the wrong network version.

Now also in the nodes domain, when connecting to another node, we should have custom TLS verification exceptions for different scenarios that we can react to and report to the end user.

I can imagine that in the testnet.polykey.com we can create separate versions like 1.testnet.polykey.com or testnet.polykey.com/1. Note that the eth network uses special names like "goerli" or "prater"... etc.

Now when updating the PK CLI and creating distributions, it will be important to expose this additional version information in the metadata of the distributions. Perhaps there is image metadata, or expected name. Like: 1.0.0-alpha.3-3-4 meaning 1.0.0-alpha.3 is the source version, while 3 is the state version, and 4 is the network version. You should probably see 1.0.0-1-1 for most of the time.

Additional context

Tasks

  1. Change serviceVersion to networkVersion in src/config.ts
  2. Expose all 3 version data to agent status command
  3. Create a agentVersion RPC client call and AgentVersion handler
  4. Create a agentVersion RPC agent call and AgentVersion handler - does this make sense? Or Should we be using NodesVersion instead?
  5. Integrate network version into the root certificate - this means when the version updates - the certificate will need to be renewed with new information - while being signed with the existing key (we don't rekey the root key just because the version has changed)
  6. Add logic to custom TLS verification and network version incompatibility to both client TLS and agent TLS
  7. Ensure that these problems are reported back to the end user on STDERR when it fails, with specialised exception or otherwise
  8. Report this information on our dashboards
  9. Integrate network version information into the SRV records
  10. Integrate network version information into the dashboards
CMCDragonkai commented 11 months ago

Important to note that when a new network version occurs, there would be new seed nodes, and new network status service (because the PolykeyClient might also be incompatible) to represent it.

The dashboard frontend is separate and should be capable of receiving data from all of them and showing them. Alternatively it could also be separately deployed like how eth does goerli and prater changes.

User nodes are likely to swap over to the new network version as they update their nodes. Some stragglers will stay on the old version, but eventually the seed nodes will be turned off for the older version.

tegefaulkes commented 11 months ago

I'm thinking, since we need to include a bunch of information in the certificate to prove we're allowed in a network. We create a temporary leaf certificate that contains all this information. This certificate won't be used as a link in the chain but generated when needed from the leaf certificate in the chain. This way we can cut back on creating a new certificate in the chain every time new information is added.

We can add very temporary information within this temporary leaf cert without bloating the whole chain. This could be our version information but also any tokens needed to prove access to something. For example, prove we have access to a certain or multiple networks. Proving permissions using the biscuit system in the future. Proving access to certain secrets. etc etc. Generally we can include any metadata to be available when deciding to reject a connection.

CMCDragonkai commented 11 months ago

Renewing a certificate is not meant to be a big deal. Keys aren't being changed, just the cert itself. This means everything in the sigchain still works, and all node IDs are still the same. I think it's totally fine to expect our certs to be regularly renewed. In this case as I mentioned before sigchain represent claimed properties, while root cert represents "identity".

CMCDragonkai commented 10 months ago

There's some commentary about this in relation to the network status endpoint:

https://github.com/MatrixAI/Polykey-CLI/pull/94/files#r1454376892

But we will need RPC extensibility to allow the frontends (CLI, ...etc) to extend the RPC handlers to provide additional data that the library itself is not aware of.