MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Redesign usage of CDSS with `VaultInternal` #305

Closed joshuakarp closed 2 years ago

joshuakarp commented 2 years ago

Specification

CDSS

Currently, VaultInternal utilizes the create-destroy pattern. We're considering instead refactoring this into a create-destroy-start-stop pattern. This would allow us to align function usage between VaultManager and VaultInternal as follows:

VaultManager VaultInternal
createVault createVaultInternal
openVault start
closeVault stop
destroyVault destroy

The meat of the logic should now be in 2 static methods of VaultInternal, representing "smart asynchronous creators" of VaultInternal. These should be:

VaultInternal.createVaultInternal - creates a new vault from scratch, think `git init`
VaultInternal.cloneVaultInternal - clones a vault, think `git clone`

creating a VaultInternal instance with new or existing state should use VaultInternal.createVaultInternal. First time cloning of a vault should use VaultInternal.cloneVaultInternal. Get all the isogit related side-effects (client-side logic) into these static methods, the VaultManager would only manage any additional metadata.

The start method now has to setup internal state, specifically Git state, it needs to initialize the repository and also perform the initial commit, however it has to be idempotent, so if these actions are already done, it won't bother with it.

The stop cannot stop efs, because the EFS lifecycle is handled outside by VaultManager

The destroy should then destroy the Vault internal state, however additional destruction management would have to occur inside the VaultManager which might mean it has to construct (and thus load into-memory) the instance before it can destroy it.

The cloning and pulling code should work using withF and withG as used by NodeConnectionManager. The cloning and pulling will need to have access to the ACL domain to ensure that permissions are properly set. Because permission checking occurs inside VaultManager and not VaultInternal and occurs before VaultInternal functions are executed, this would imply that VaultManager has access to the ACL and is bound to the ACL. Which would mean the meat of logic discussed above should be in VaultManager.

Metadata

The VaultInternal now needs to maintain its own DB domain to store vault related metadata. 2 new properties are relevant:

  1. dirty - this is a Boolean that determines whether the vault is in a dirty state or not
  2. remote - if this contains a string nodeId:vaultId, then this vault should be considered a "remote" vault and must be immutable - see #309 and #253

The internal domain is better than storing a POJO in the VaultManager. That logic should be removed in favor of VaultInternal having it's own internal domain, which is more efficient and more flexible.

Additional metadata along with the dirty Boolean can be added to enable us to fix #252. This will be important to ensure internal consistency of our vaults in the presence of program crashes and also inadvertent OS shutdowns or just program killing.

misc.

The implementations of the functions in VaultManager would internally call the respective function above in VaultInternal. This would also allow us to add any extra functionality that needs to be separate from the VaultInternal functions into the VaultManager. For example, on creation of a vault, there's a few factors that need to be considered:

Conversely, we should remove any instances where we perform similar logic between the two corresponding functions, and ensure it's centralized in a single place.

Locking will be handled via two RWLock locks. One lock for the object map in VaultManager. The other lock for the VaultInternal instance. The VaultManager RWLock will handle the life cycle of the VaultInternal Instance. Write locking will be used for the creation/destroy lifecycle. read locking will be used with the WithVaults when accessing the vaults. The VaultInternal locking will be used for the respective readF/G and writeF/G functions.

Git repository: anything to do with git instantiation for the vault could also be centralised within VaultInternal.createVaultInternal

To ensure the integrity of the vaultName information we need to apply locking when reading and writing to the VaultName->VaultId mapping in the database. This is to ensure that when we check if a vaultName already exists in the mapping we can trust that this information is correct and avoid concurrency issues with vault creation and renaming.

Additional context

A note on usage of the CDSS pattern:

tasks

  1. [x] Convert VaultInternal to using the CDSS async-init pattern.
    1. [x] create a createVaultInternal constructor for creating a VaultInternal instance on new or existing state.
    2. [x] create a cloneVaultInternal constructor for first time cloning of a vault.
    3. [x] all isogit related side effects and creating logic should be within these creation methods. Ideally VaultManager shouldn't touch isogit at all.
    4. [x] start method handles the creation of any state. This should be idempotent. this also handles any db domain creation.
    5. [x] destroy should destroy all VaultInternal state.
  2. [x] Cloning and pulling should use the with functions through the NodeConnectionManager.
  3. [x] permissions will need to be checked before cloning and pulling.
  4. [x] Move handling of metadata to inside of VaultInternal
    1. [x] Implement a dirty field for tracking if the git repo is in a dirty state.
    2. [x] Implement a remote containing string nodeId:vaultId for tracking remote node and vault. this is also used to check if the vault is immutable.
    3. [x] Any other metadata needs to be moved here as well.
    4. [x] Add ability to VaultManager to access the metadata for checking if a vault exists and checking/updating vaultName.
  5. [x] Update VaultManager and VaultInternal to use RWLock locks.
    • [x] Use object map write locking for VaultInternal lifecycle.
    • [x] Use object map readLocking for accessing VaultInternal contents.
    • [x] Use VaultInternal write/read locking for the read/write functions.
    • [x] VaultManager git handling functions should be using both read locks.
    • [x] VaultInternal.pullVault should respect the write lock.
    • [x] add locking for vaultName -> vaultId metadata.
  6. [x] Review and update tests for...
    • [x] VaultManager object map life-cycle.
    • [x] VaultManager object map locking.
    • [x] VaultInternal life-cycle.
    • [x] VaultInternal locking.
    • [x] git clone/pull permissions testing.
    • [x] concurrently creating vaults test.
    • [x] test for VaultInternal.pullVault locking
CMCDragonkai commented 2 years ago

Looking into this now.

From what I see, VaultInternal should have at least the static method called createVaultInternal and the destroy method.

However because we have to differentiate destroying the in-memory instance on the on-disk state, then destroy and stop matters as well, this way when we stop the VaultManager, what's most important is that we stop on the VaultInternal. The VaultManager itself can have additional cleanup operations when this occurs.

So having both VaultManager and VaultInternal have CDSS just enables both sides to manage their side-effect/stateful responsibilities.

However there is one issue with the openVault call. One cannot open a vault that isn't already created, because you cannot start the instance. Therefore one might ask what the purpose of openVault might be. I suggest it would be similar to a sort of getter function, and but one that should throw an exception.

This also means the there is a public interface to vaults, and this is returned to the caller of VaultManager.

But as we have evolved the design of the vault manager, we also want vaults to be used within a with context, so that transactions can take place during mutation of the vaults. This is more controlled way of manipulating vaults.

CMCDragonkai commented 2 years ago

I've also moved in the new RWLock into #266. So that can be used for the vault map as well.

CMCDragonkai commented 2 years ago

Currently the fresh parameter is essential to VaultInternal unlike it being an optional property in other domains. It is what determines whether the VaultInternal is constructed with using git.init which would try to initialise the git state, or whether it picks up existing state internally.

The fresh was meant to say to start anew, and wipe out any existing state there.

Where to load existing state or not is meant to be automatic, which is to say that VaultInternal upon starting should be checking the EFS's state for the vault directory and git directory, and then just loading relevant things into memory, and only if these did not exist, would it then do the git.init.

The only issue is that during clone operations, which creates state, this would mean that have to "pre-seed" the git state, prior to running VaultInternal.createVaultInternal so that it is aware the state already exists. I think this is how it is currently working anyway with cloneVault.

Moving to CDSS, would also mean that the deliberation of fresh would go into the start instead of the create.

All of this means that isogit's functionality will be used in VaultManager and VaultInternal, whereas before I thought that git would be isolated to VaultInternal.

It is possible to move git.clone to also VaultInternal, by creating a second creation static method. The creation static methods are functions that outside the class instance, and this would be similar to how C++ an other languages have overloaded/multiple constructors. So cloning would just be a separate constructor and would be an asynchronous method too.

So we may have:

VaultInternal.createVaultInternal()
VaultInternal.cloneVaultInternal()

As 2 ways of constructing a vault internal, one is creating it from scratch, which equates it to git init, and another is cloning from elsewhere which equates it to git clone.

CMCDragonkai commented 2 years ago

This also means EFS state is considered a dependency for VaultInternal, so git.init and git.clone will be placed into createVaultInternal and clonseVaultInternal respectively, and therefore fresh is applied there in particular. Which probably means that fresh parameter won't exist in start at all.

tegefaulkes commented 2 years ago

I have updated the spec here. any discussion of this aspect of the PR is done here.

tegefaulkes commented 2 years ago

After moving the metadata inside of VaultInternal The only metadata the VaultManager would need is a mapping of VaultName -> VaultId. This is used by the methods

So instead of accessing the VaultInternals metadata from the VaultManager, The VaultManager can just maintain a VaultName->VaultId mapping inside it's DB. we can store the VaultName inside of the VaultInternals metadata but I don't think it's really needed.

Alternatively If we feel that the accessing the VaultInternals Metadata directly is needed. We can replicate how the js-encryptedfs works by using the sublevel prefixer method. I will need to explore this some more and test it.

CMCDragonkai commented 2 years ago

Because each VaultInternal has its own DB level. This means it's a prefix.

If you want to know whether a "level" is filled our not. You can use DB.count.

This can return the number of fields in the level. You can use this to know whether a VaultInternal exists or not.

However the VM also needs to be able to look up VaultId by VaultName. Ideally this should be solved by #188 and https://github.com/MatrixAI/js-db/issues/1. But for the mean time you would need a DB level that is VaultName -> VaultId.

CMCDragonkai commented 2 years ago

Make sure you're locking these operations though!

tegefaulkes commented 2 years ago

I have a small problem. We want to do all the state creation during start() as per the spec. We need to create the Metadata inside the db at this point. So we need to know what the vaultName and the remote details are at this point.

  1. I could add them as parameters to the start method but that seems like a bad idea since we won't need to pass them unless we specify fresh and I don't think we want the ability to set remote details and name when we start fresh.

  2. I can add them as properties to the VaultInternal but then we need to update the values in two places. This only applies when we update the name and remote details.

I think option 2 works better here. but can you think of a better option @CMCDragonkai ?

tegefaulkes commented 2 years ago

Do we even need to clear the whole metadata when starting fresh? seems like we can just keep the vault name in the metadata while clearing everything else.

CMCDragonkai commented 2 years ago

If --fresh, it really means fresh. So you should clear the data.

You do need to recreate state in start, the problem is that db level will not be setup properly if done in create* static methods.

Then with start you'd have all these parameters like vaultName and such. These should have some meaning behind them, and the various setup methods inside the VaultInternal should do the rest similar to KeyManager.

tegefaulkes commented 2 years ago

So we check for permissions when a node receives a pull or clone request. I think this means we should be checking the permissions in the vaultsGitInfoGet.ts and vaultsGitPackGet.ts services.

I can see one problem with how the permissions are checked though. To confirm if we have permission for a vault, we provide our NodeId as part of the request message or metadata. but since NodeIds are public information It seems possible that we can gain access to a vaults information so long as we can provide a NodeId with the correct permissions.

Seems like to prove that we can access a vault we only have to provide something akin to a username when we should be using some secret or token instead. Either some proof that we have the access or proof that we are the NodeId in question.

CMCDragonkai commented 2 years ago

No the NodeId isn't just supplied to the remote agent. It's checked via MTLS. So if the MTLS is verified, then the NodeId can be trusted.

But remember you're not taking the NodeId from the GRPC request message. You're taking it from the reverse proxy.

CMCDragonkai commented 2 years ago

I think the original idea was that the reverse proxy adds the NodeId to the request metadata that is then sent to the GRPCServerAgent. A sort of Forwarded-For header that can be trusted. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For and https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Forwarded.

However the reverse proxy doesn't "understand" GRPC frames atm, so I guess it's not able to do this.

Instead, what we had to do to pass the reverse proxy GRPCServerAgent, and allow the handlers to easily acquire the currently authenticated connection information.

CMCDragonkai commented 2 years ago

Right now our src/client/utils/utils.ts has a sessionInterceptor.

We can create one for src/agent/utils.ts that does the same thing, but this time fetches information from the reverse proxy and then adds that metadata into the proxied request.

CMCDragonkai commented 2 years ago

@tegefaulkes refer to this diagram https://excalidraw.com/#room=57543403694c95eba2e9,5qJFVQlpWc0vegUgEAQIpA which is regarding the session token usage and https://github.com/MatrixAI/js-polykey/wiki/network. It is implemented through the sessionInterceptor. I think we need to add a new one called connectionInfoInterceptor. It should be calling ReverseProxy.getConnectionInfoByEgress() automatically and augmenting the request message leading metadata. The new metadata should use Forwarded-For or related headers as it's basically the same idea as in HTTP1.1 reverse proxies.

CMCDragonkai commented 2 years ago

I just remembered that these are "client interceptors", they are not interceptors applied on the server side: https://github.com/MatrixAI/js-polykey/pull/296#issuecomment-989740159.

GRPC upstream has discussion about adding server interceptors: https://github.com/grpc/grpc-node/issues/419.

This is why the authenticator function exists. It plugs a function into the handlers for the handlers to authenticate easily.

Then the solution is to adapt the authenticator in client/utils to agent/utils instead. Because what you want here is a function that will give you the connection information.

type ConnectionInfo = {
  nodeId: NodeId;
  certificates: Array<Certificate>;
  egressHost: Host;
  egressPort: Port;
  ingressHost: Host;
  ingressPort: Port;
};
CMCDragonkai commented 2 years ago

@tegefaulkes the discussion about this authentication problem though should be in different/new issue.

tegefaulkes commented 2 years ago

I will create a new issue for this.

tegefaulkes commented 2 years ago

New issue created here #342

tegefaulkes commented 2 years ago

It seems unnecessary for the nodeConnectionManager to contain the revProxy. It's only used for the method NCM.holePunchReverse() which just wraps revProxy.openConnection(). and is only called in src/agent/service/nodesHolePunchMessageSend.ts service. I feel the logic should be in-lined within that service and we can remove revProxy from the NodeConnectionManager altogether.

tegefaulkes commented 2 years ago

Do we want to update the locking in VaultInternal to use a RWLock as well?

CMCDragonkai commented 2 years ago

The same RWLock instance can be shared between VaultManager and VaultInternal.

CMCDragonkai commented 2 years ago

If you're going to have 2 locks, then the duties must be separated here:

VaultManager RWLock:

Vault RWLock (this is now dependent on the VaultManager RWLock read (these locks can only be held if the read lock is also held)):

This way, any GRPC service handlers or domain methods can still access the vault instance concurrently without blocking each other. And blocking only occurs when you are creating or destroying the vault.

This means if any context is holding a Vault RWLock, they would necessarily be holding a VaultManager RWLock Read. This would then block any creation or destruction routines of the vault in VaultManager.

tegefaulkes commented 2 years ago

This is mostly addressed now. Just blocked by #342

tegefaulkes commented 2 years ago

When checking the permissions for a vault. I think we don't want to reveal the existence of a vault, only that we don't have permissions for a vault that may exist or not. That is to say, when permissions are concerned we should only throw non-existing vault errors after checking and throwing any permission errors. So we should only ever see permission errors if the vault doesn't exist. This will hide the existence of vaults unless a node has permissions.

This is assuming that if a vault doesn't exist then no permissions exist for it in the ACL. So I'll need to double check that the ACL/vaultManager is cleaning said permissions when we destroy or refresh vaults.

tegefaulkes commented 2 years ago

Small problem I'll have to look into. It seems that there is no real locking when creating a new Vault. it checks if there is a vault that already exists with that name. However nothing will stop us from concurrently creating two vaults with the same name. That will be a problem. Adding a note to the spec.

I think the VaultManager will need to hold a lock on itself for creating new vaults until it adds the name to the name->id maping in it's database. Likely the lock will be needed every time we write and read that mapping.

CMCDragonkai commented 2 years ago

What's what the object map pattern is for, it's supposed to prevent the situation where 2 vaults get created with the same name... etc.

tegefaulkes commented 2 years ago

Problem is, the vaultId is only created during createVault. It can't exist in the map before that. All other lifecycle stuff such as open, close and destroy respect the object map locking.

tegefaulkes commented 2 years ago

A solution to the concurrent vault creation for now is to have a RWLock for using the vaultName->VaultId map in the database. This will mean that vault creation will be serialised. This will be a temp fix until we get db indexing working down the line MatrixAi/js-db#1. Note that all other lifecycle will use the objectMap locking. this is specifically to deal with the problem of updating and reading vaultNames at the same time.

I'll add this into the spec in a moment.

tegefaulkes commented 2 years ago

I'm getting this very odd bug. When creating 3 or more vaults at a time we get an ErrorEncryptedFSNotRunning error from the EFS. This is happening during the vaultInternal start method when doing..

    console.log('efs', this.efs[running])
    this.efsVault = await this.efs.chroot(this.vaultDataDir);
    console.log('efsvault', this.efsVault[running])

We end up with the log

INFO:VaultInternal:Creating VaultInternal - z4yVwcUzsNm1pPQpUZNHjhc
INFO:VaultInternal:Starting VaultInternal - z4yVwcUzsNm1pPQpUZNHjhc
INFO:VaultInternal:Creating VaultInternal - z6DSRigZenUGF7mfub5BvdP
INFO:VaultInternal:Starting VaultInternal - z6DSRigZenUGF7mfub5BvdP
INFO:VaultInternal:Creating VaultInternal - z5XnzXrbxvmJqzxTg3pFYQo
INFO:VaultInternal:Starting VaultInternal - z5XnzXrbxvmJqzxTg3pFYQo
INFO:VaultInternal:Creating VaultInternal - zKZABTgSG8qUrREnipB4ngK
INFO:VaultInternal:Starting VaultInternal - zKZABTgSG8qUrREnipB4ngK
INFO:VaultInternal:Creating VaultInternal - zLdfm3YMNmXSEXeSZzs7B82
INFO:VaultInternal:Starting VaultInternal - zLdfm3YMNmXSEXeSZzs7B82
  console.log
    efs true

      at Object.start (src/vaults/VaultInternal.ts:274:13)
          at runMicrotasks (<anonymous>)

INFO:DB:Stopping DB

ErrorEncryptedFSNotRunning

Why is this happening? The efs is running just before we call efs.chroot so why is that failing with ErrorEncryptedFSNotRunning? I'll have a look at the efs source.

CMCDragonkai commented 2 years ago

The debugger might help in this case too.

tegefaulkes commented 2 years ago

Currently we have an error called ErrorVaultImmutable which is thrown when trying to mutate a vault that was cloned. Basically it means we can't mutate the remote is defined. Conversely we need an error for the complement where we are trying to pull a vault but no remote is defined. To have some symmetry to these errors I'm renaming ErrorVaultImmutable to ErrorVaultRemoteDefined and adding ErrorVaultRemoteUndefined.

tegefaulkes commented 2 years ago

This is mostly done for now. Some of the remaining tests require functioning vault mutating and committing.

tegefaulkes commented 2 years ago

Finished up the tests now.