MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Make VaultsSecretsMkdir support specifying multiple secret paths #819

Closed aryanjassal closed 1 month ago

aryanjassal commented 1 month ago

Description

Following the behaviour of UNIX's mkdir, the behaviour of VaultsSecretsMkdir is being tweaked to function more seamlessly.

The VaultsSecretsMkdir handler now supports creating multiple directories in multiple vaults, all in a single commit message. If a directory fails to be created, the whole command will no longer fail, merely returning an error.

Issues Fixed

Tasks

Final checklist

linear[bot] commented 1 month ago

ENG-359 Implement `secrets mkdir` command

aryanjassal commented 1 month ago

you could do the grouping on the CLI side and then we can have end to end streaming for this handler.

I believe this was discussed when we were talking about managing this for secrets rm command. There, you mentioned that you prefer having all the logic at one place, and I moved the grouping logic to the handler side.

If we want to update this, then we can, but then this would be inconsistent with the behaviour of VaultsSecretsRemove. The DuplexHandler also expects each streamed object to be "self-contained", so we would also need to pass the vault information using the metadata, which would be weird usage-wise. I already think that us using the metadata field for options like recursive is weird enough (as it is usually used for auth, and it provides no autocompletion/LSP integration), and using it for the vault name might not be best for development or maintenance given the lack of intrinsic documentation.

On the other hand, it would enable us to easily implement local file system integration if the grouping is done on CLI side, as the developer can detect the absence of a vault path, infer that as an operation on the user's file system, and perform the command on the local file system. With the current approach, we would need to filter paths which don't contain a vault path, then send over the rest which do, to be looped over in the handler instead for grouping.

While both approaches have their own advantages and disadvantages, we can work with either. We need to figure out a "standard" and stick with it for further commands.

tegefaulkes commented 1 month ago

By grouping I really mean sorting. So we apply the constraint that the streamed directory paths must be sorted so that means directories within a single vault is grouped together. Then we can easily create each directory for a single vault all at once within the same writeF. But all that is moot since creating directories doesn't actually create commits within the version history so it doesn't even need to be ordered.

I believe this was discussed when we were talking about managing this for secrets rm command. There, you mentioned that you prefer having all the logic at one place, and I moved the grouping logic to the handler side. That was more about avoiding composing RPC calls within the CLI command if possible. We prefer to do our stuff within a single RPC call but that doesn't mean we can't do any CLI side processing.

Also there's no reason to pass the vault information in the metadata. You can keep it in the message.

aryanjassal commented 1 month ago

But all that is moot since creating directories doesn't actually create commits within the version history so it doesn't even need to be ordered.

As such, I guess I can revert the changes I made to vaultOps.mkdir as ensuring a single write transaction isn't necessary anymore. This would make the command support end-to-end streaming.

Another concern I have is that on UNIX, the touch command also works like this, touching files in order that they were provided in. If we perform grouping, then it will break the constraint of UNIX's order and might yield unexpected behaviour. On the other hand, if we don't group the commands, then the vault will see multiple commits for creating multiple secrets, which isn't optimal. How should the touch command handle this?

That was more about avoiding composing RPC calls within the CLI command if possible. We prefer to do our stuff within a single RPC call but that doesn't mean we can't do any CLI side processing.

I see. However, I still have the question about the "standard" for this. What do we follow? Grouping on the client side, and the handler expects the inputs to be sorted by vaults or doing it on the handler side as we currently are?

In my opinion, we should do the handling on the client side. This would make the most sense as it would allow the most freedom. For example, in the touch command, the user can specify multiple touched files within the same vault (which would be the use case for most users), and the sorted handler will group them together in a single writeF command. When the handler encounters a different vault name in the input stream, then it can start a new write transaction for the new vault. For example, this matches the behaviour of the UNIX commands very closely, which might be a good tradeoff between consistency and reducing commit redundancy.

[aryanj@matrix-34xx:~]$ pk secrets touch v1:file1 v1:file2 v2:file3
 > making file1 and file2 in v1
 > making file3 in v2

[aryanj@matrix-34xx:~]$ pk secrets touch v1:file1 v1:file2 v2:file3 v1:file4
 > making file1 and file2 in v1
 > making file3 in v2
 > making file4 in v1
CMCDragonkai commented 1 month ago

But all that is moot since creating directories doesn't actually create commits within the version history so it doesn't even need to be ordered.

As such, I guess I can revert the changes I made to vaultOps.mkdir as ensuring a single write transaction isn't necessary anymore. This would make the command support end-to-end streaming.

Another concern I have is that on UNIX, the touch command also works like this, touching files in order that they were provided in. If we perform grouping, then it will break the constraint of UNIX's order and might yield unexpected behaviour. On the other hand, if we don't group the commands, then the vault will see multiple commits for creating multiple secrets, which isn't optimal. How should the touch command handle this?

That was more about avoiding composing RPC calls within the CLI command if possible. We prefer to do our stuff within a single RPC call but that doesn't mean we can't do any CLI side processing.

I see. However, I still have the question about the "standard" for this. What do we follow? Grouping on the client side, and the handler expects the inputs to be sorted by vaults or doing it on the handler side as we currently are?

In my opinion, we should do the handling on the client side. This would make the most sense as it would allow the most freedom. For example, in the touch command, the user can specify multiple touched files within the same vault (which would be the use case for most users), and the sorted handler will group them together in a single writeF command. When the handler encounters a different vault name in the input stream, then it can start a new write transaction for the new vault. For example, this matches the behaviour of the UNIX commands very closely, which might be a good tradeoff between consistency and reducing commit redundancy.

[aryanj@matrix-34xx:~]$ pk secrets touch v1:file1 v1:file2 v2:file3
 > making file1 and file2 in v1
 > making file3 in v2

[aryanj@matrix-34xx:~]$ pk secrets touch v1:file1 v1:file2 v2:file3 v1:file4
 > making file1 and file2 in v1
 > making file3 in v2
 > making file4 in v1

Did you consider that CLI and agent RPC is concurrent. You have to test under the conditions that multiple concurrent calls are made. That's why transactions are still important sometimes.

CMCDragonkai commented 1 month ago

That was more about avoiding composing RPC calls within the CLI command if possible. We prefer to do our stuff within a single RPC call but that doesn't mean we can't do any CLI side processing.

So there is a thing call promise pipelining which makes RPC composition more efficient.

We don't support it just yet https://capnproto.org/rpc.html, but it would be possible under js-rpc to do so. And https://gist.github.com/crabmusket/6a63eec5aa9d92649f66614f09777c6a

Each call right now opens up a stream, but the underlying transport is flexible enough to support this concept.

It also aligns with our ability to do RPC conversations efficiently.

Generally speaking this is kind of important for any sort of functional composition across RPC boundaries.

CMCDragonkai commented 1 month ago

But all that is moot since creating directories doesn't actually create commits within the version history so it doesn't even need to be ordered.

As such, I guess I can revert the changes I made to vaultOps.mkdir as ensuring a single write transaction isn't necessary anymore. This would make the command support end-to-end streaming.

Another concern I have is that on UNIX, the touch command also works like this, touching files in order that they were provided in. If we perform grouping, then it will break the constraint of UNIX's order and might yield unexpected behaviour. On the other hand, if we don't group the commands, then the vault will see multiple commits for creating multiple secrets, which isn't optimal. How should the touch command handle this?

That was more about avoiding composing RPC calls within the CLI command if possible. We prefer to do our stuff within a single RPC call but that doesn't mean we can't do any CLI side processing.

I see. However, I still have the question about the "standard" for this. What do we follow? Grouping on the client side, and the handler expects the inputs to be sorted by vaults or doing it on the handler side as we currently are?

In my opinion, we should do the handling on the client side. This would make the most sense as it would allow the most freedom. For example, in the touch command, the user can specify multiple touched files within the same vault (which would be the use case for most users), and the sorted handler will group them together in a single writeF command. When the handler encounters a different vault name in the input stream, then it can start a new write transaction for the new vault. For example, this matches the behaviour of the UNIX commands very closely, which might be a good tradeoff between consistency and reducing commit redundancy.

[aryanj@matrix-34xx:~]$ pk secrets touch v1:file1 v1:file2 v2:file3
 > making file1 and file2 in v1
 > making file3 in v2

[aryanj@matrix-34xx:~]$ pk secrets touch v1:file1 v1:file2 v2:file3 v1:file4
 > making file1 and file2 in v1
 > making file3 in v2
 > making file4 in v1

Client IO vs Agent IO. This is important design decision. In the case of PK CLI, you must assume that the agent could be remote. That means if I touch a local file, only the CLI client can be one doing that job, cannot expect the agent to do that work because it might not be on the same filesystem.

CMCDragonkai commented 1 month ago

Another concern I have is that on UNIX, the touch command also works like this, touching files in order that they were provided in. If we perform grouping, then it will break the constraint of UNIX's order and might yield unexpected behaviour. On the other hand, if we don't group the commands, then the vault will see multiple commits for creating multiple secrets, which isn't optimal. How should the touch command handle this?

This requires understanding what an atomic commit is.

The right abstraction here is to define what a "Vault Operation" means. That's why we have a vault ops abstaction.

1 vault op should equal 1 commit.

A vault op however doesn't mean you cannot create multiple files or delete multiple files together.

In fact it appears the right way to do this is to ensure that a vault operation represents a transaction. And that is what a commit is.

Pass through the transaction context through however many vault operations, and then when it commits, that's a single commit.

This allows you do to multiple touch operations that are combined together into a single commit using the same transaction abstraction that we do in the js-db.

In fact this is possible across RPC boundaries as well, as it means you create a stream and as long as the stream is kept open, that represents the transaction context for the commit. Thus even across the RPC boundary you can create a transaction.

The other issue is concurrent commits. In the case of js-db we use a combination of OCC and PCC. OCC is the default and PCC you have to elect into it.

The design of git means that you can't really do OCC, and you're not really taking snapshots of vault state. You're always getting the full state of the vault at any time to make your commit. And a commit just means whatever is still a dirty state.

To simplify this you could state that a vault op must be PCC forcing a vault level lock on commit sequencing. However this isn't technically required by git, it does allow one to make separate edits to the shared files within a git repository, and you can in fact commit them separately. However we don't really do this because we track the files that we need to commit anyway. And each vault operation takes full mutable control of the vault.

If this aspect hasn't been specced out, you need to spec out the trade-offs on the approach.

aryanjassal commented 1 month ago

Another concern I have is that on UNIX, the touch command also works like this, touching files in order that they were provided in. If we perform grouping, then it will break the constraint of UNIX's order and might yield unexpected behaviour. On the other hand, if we don't group the commands, then the vault will see multiple commits for creating multiple secrets, which isn't optimal. How should the touch command handle this?

Brian and I had a small discussion regarding this. We were only able to brainstorm ideas, but his idea stipulated the following.

Instead of using existing writeF commands, what we can do is rely on manually acquiring resource locks. Then, when we encounter a new vault name, acquire a writable transaction for that vault, then make the file in the vault. This will ensure that the files are created in order and the files all come under a single transaction. We can change the behaviour of writeF to return the resource lock instead of managing it internally, and we will be able to manually manage the resource, unlocking the resource when the incoming stream of filenames has all ended. This approach will allow us to adhere to both requirements and also allow us to filter for file paths pointing to the local file system.

This approach will entail modifications to previous commands like the VaultsSecretsRemove handler, as that groups the paths by vault name to delete all files in a single transaction, but it's not 1:1 replica of UNIX's rm behaviour.

We will discuss this more and expand on this idea in a separate meeting today.

aryanjassal commented 1 month ago

During development, I have noticed that some tests take exceptionally long to run. The longest running test by far is client/vaults.test.ts taking approximately 300 seconds if testing the entire file. Brian mentioned that all the client tests needed to be collapsed into a single file, because otherwise it would cause memory leaks and crash the CI.

There is another test file which also takes pretty long to run; the vaults/VaultOps.test.ts file. It takes over 100 seconds to run through all the tests. As tests in only separate files can be run in parallel, I believe the tests need to be separated into smaller chunks.

Is there an issue where we are tracking this? Or should I make a new issue to track this?

aryanjassal commented 1 month ago

All the tasks have been completed for this PR, and all the reviews are also approving. Merging.

CMCDragonkai commented 1 month ago

During development, I have noticed that some tests take exceptionally long to run. The longest running test by far is client/vaults.test.ts taking approximately 300 seconds if testing the entire file. Brian mentioned that all the client tests needed to be collapsed into a single file, because otherwise it would cause memory leaks and crash the CI.

There is another test file which also takes pretty long to run; the vaults/VaultOps.test.ts file. It takes over 100 seconds to run through all the tests. As tests in only separate files can be run in parallel, I believe the tests need to be separated into smaller chunks.

Is there an issue where we are tracking this? Or should I make a new issue to track this?

Hmm I'm not sure why client tests have to be in one a single file? It seems better isolation is needed, but also it's possibly due to the concurrent test runner, maybe there's a way to share resources between the client tests, or make them grouped up in some way.

tegefaulkes commented 1 month ago

At the time they were moved into a single test file because the CI was running out of memory when they were all separated. It was a quick fix at the time.

I suspect the memory leak was some part of the networking code keeping a reference in the global scope. Which fails to clean up properly between tests. We'll have to fix that before splitting up the tests again.