Vault and File Schema - Githubissues

CMCDragonkai commented 3 years ago

Specification

TBD

Additional context

https://gitlab.com/MatrixAI/Engineering/Polykey/polykey-design/-/issues/44
https://gitlab.com/MatrixAI/Engineering/Polykey/polykey-design/-/issues/40
76
4
discussion from the vaults refactoring MR https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205#note_689647018

Tasks

Complete discussion and specification
Develop the JSON structure of schema (any optional parameters, or just basic for now?)
Apply validation logic of schema to the vault contents (when adding secrets)
Integrate into creation of vault
Integrate into a later call (applySchema)

TBD

joshuakarp commented 2 years ago

I've been writing up some of my own thoughts on vault and file schemas in our MR for vaults refactoring https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205#note_689647018. I'll synthesise my thoughts, and bring them into here for discussion.

joshuakarp commented 2 years ago

When considering vault schemas, I've been thinking about what the "intention" of a vault is. I've thought of a few different approaches to this:

1. A "relational database"-like structure

We store secrets of a specific structure, and enforce that all secrets within this vault follow this specific structure (with the possibility for optional fields).

This would mean that the structure of the secret itself is dependent on the schema of the vault. That is, the individual components of some composite secret are structured at the vault level.

For example, suppose we had a vault schema represented like a JSON as follows:

{
  "label": {
    "/mediatype": "text/plain",
  },
  "url": {
    "/mediatype": "text/plain",
  },
  "username": {
    "/mediatype": "text/plain",
  },
  "password": {
    "/mediatype": "text/plain",
  },
  "note": {
    "/mediatype": "text/plain",
  },
}

Then, with this relational database-like structure of the vault, our vault would appear as follows:

label	url	username	password	note
amazon	amazon.com.au	user1	password1	my amazon login
twitter	twitter.com.au	user1	password1	my twitter login
...	...	...	...	...

2. A specified collection of secrets

We store a specific set of secrets within our vault. I see this more like a directory of files, where we specify a list of secrets that must be found in the vault (could also limit some of these secrets as optional).

For example, suppose we wanted a vault that stored all the sensitive information required for onboarding an employee at Matrix AI. We could have a JSON schema as follows:

{
  "toggl-username": {
    "/mediatype": "text/plain",
  },
  "toggl-password": {
    "/mediatype": "text/plain",
  },
  "zoho-email": {
    "/mediatype": "text/plain",
  },
  "zoho-password": {
    "/mediatype": "text/plain",
  },
  "aws-access-key": {
    "/mediatype": "text/plain",
  },
}

Then, our vault would appear as follows:

id	secret
toggl-username	amazon.com.au
toggl-password	password1
zoho-email	someone@matrix.ai
zoho-password	password1
aws-access-key	abcd1234

3. "Secret" schemas

This third option shies away from the idea of enforcing the structure of the secret at the vault level. Instead, we create schemas that specify the structure of a secret.

For example, a schema for a login secret (same as the vault schema from option 1):

{
  "label": {
    "/mediatype": "text/plain",
  },
  "url": {
    "/mediatype": "text/plain",
  },
  "username": {
    "/mediatype": "text/plain",
  },
  "password": {
    "/mediatype": "text/plain",
  },
  "note": {
    "/mediatype": "text/plain",
  },
}

Or a schema for a credit card secret (if possible, mediatype could potentially be restricted to numerical, etc):

  "label": {
    "/mediatype": "text/plain",
  },
  "cardholder-name": {
    "/mediatype": "text/plain",
  },
  "card-number": {
    "/mediatype": "text/plain",
  },
  "ccv": {
    "/mediatype": "text/plain",
  },
  "expiry": {
    "/mediatype": "text/plain",
  },
}

Then, on a vault level, the user chooses which type of secret they'd like to add to the vault (e.g. login, credit card, etc). This could be an unrestricted add, whereby any kind of secret can be added to the vault.

There's also the potential to incorporate vault schemas here as well, where we specify the specific set of secrets that we expect to be stored in a vault. This would be the same way that we do it in option 2 - only this time, we have rigid schemas for the secrets to be added.

For example, we could then have a vault schema for Matrix AI onboarding:

{
  "toggl": {
    "/secretschema": "login",
  },
  "zoho": {
    "/secretschema": "login",
  },
  "aws": {
    "/secretschema": "aws-credentials",
  },
}

And individual vaults can be created for each team member as deemed fit.

joshuakarp commented 2 years ago

My perspective on these 3 options:

This doesn't make a huge amount of sense to me. Given that a big part of Polykey is being able to share vaults, we don't want to share an entire "database" of secrets. This would also mean that we'd need to create and share multiple vaults where we have different kinds of secrets (for example, onboarding an employee), and we'd likely have a conglomerate of vaults for lots of different purposes.
I don't feel like this is structured enough. There's too much flexibility for the user. That is, a user shouldn't have to think too much about the structure of the secrets that they need to store. We also have a repetition of structure, where we have username fields that don't share a common type. It also opens up the vault to consistency issues if we introduce optional fields. For example, we shouldn't be able to store a "username" secret without having a corresponding "password" secret. If the optional field setting is left to the user, we have potentially illogical storage.
This seems like the most balanced option to me, where we have a balanced degree of flexibility and structure. We no longer have consistency issues (like we do in option 2) because we have specific schemas for the secrets, and the vault can be as flexible or rigid as the user decides. This makes sense, given that users want to be able to share these vaults for different purposes.

CMCDragonkai commented 2 years ago

Vault schemas can be nested.

{
  "dir1": {
    ...
  }
  "dir2": {
    ...
  }
}

We have to differentiate directories from files. Which could be done with the / since it is not allowed to be used in file names.

joshuakarp commented 2 years ago

So this means a directory would also have its own vault schema applied to it? For example, we could have a vault schema which specifies some files and a directory, and this directory would specify another vault schema?

joshuakarp commented 2 years ago

So I found some more discussion hidden away in a comment on one of the mock-ups: https://gitlab.com/MatrixAI/Engineering/Polykey/polykey-design/-/issues/40/designs/Vault_Schema.png?version=163940

Notably, the following example was given for a vault schema for storing a username and password inside a directory:

{
  "dirA": {
    "username": "text/plain",
    "password": "text/plain"
  }
}

This would create a vault with a directory structure like:

/dirA
/dirA/username
/dirA/password

But what if we want to have a vault that just has a username and password in the root directory (with no extra directory)? Then, the user needs to create a brand new schema for this:

{
  "username": "text/plain",
  "password": "text/plain",
}

There's an unnecessary duplication of data here. The "username" and "password" fields between the schemas don't have any relation to each other (they're just labels for a chunk of text). That is, there's no indication (besides the label) that they're both storing the same kind of secret. Similarly, the user now has 2 vault schemas to manage which are doing very similar things.

I feel that option 3 from above is an improvement over this approach, but I'm interested to discuss this.

CMCDragonkai commented 2 years ago

Vault schemas are just directory schemas.

On 30 September 2021 9:45:06 am AEST, Josh @.***> wrote:

So this means a directory would also have its own vault schema applied to it? For example, we could have a vault schema which specifies some files and a directory, and this directory would specify another vault schema?

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/MatrixAI/js-polykey/issues/222#issuecomment-930625901 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

CMCDragonkai commented 2 years ago

These would be 2 different schemas so they are independent.

On 30 September 2021 10:15:00 am AEST, Josh @.***> wrote:

So I found some more discussion hidden away in a comment on one of the mock-ups: https://gitlab.com/MatrixAI/Engineering/Polykey/polykey-design/-/issues/40/designs/Vault_Schema.png?version=163940

Notably, the following example was given for a vault schema for storing a username and password inside a directory:
{
 "dirA": {
   "username": "text/plain",
   "password": "text/plain"
 }
}
This would create a vault with a directory structure like:
/dirA
/dirA/username
/dirA/password
But what if we want to have a vault that just has a username and password in the root directory (with no extra directory)? Then, the user needs to create a brand new schema for this:
{
 "username": "text/plain",
 "password": "text/plain",
}
There's an unnecessary duplication of data here. The "username" and "password" fields between the schemas don't have any relation to each other (they're just labels for a chunk of text). That is, there's no indication (besides the label) that they're both storing the same kind of secret. Similarly, the user now has 2 vault schemas to manage which are doing very similar things.

I feel that this is the wrong approach to take, but I'm interested to discuss this.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/MatrixAI/js-polykey/issues/222#issuecomment-930639085 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

joshuakarp commented 2 years ago

Had a quick discussion with Roger about this. Some clarifications:

We need to remember that a "vault" can essentially be seen as a directory on the filesystem, where we store secrets (files), and embed version control inside it. As such, a directory inside a vault can analogously be seen as a nested vault.

Therefore, a vault schema is just a description for a directory. The vault schema should be minimal and flexible to reflect this filesystem structure.

For example, our username and password vault schema:

{
  "username": "text/plain",
  "password": "text/plain",
}

The vault is then expected to contain exactly these 2 text files.

Note, we can eventually utilise native features from JSON schemas to increase the expressive power of our schemas without requiring a lot of work (such as the strict flag for loosening the schema: providing optional fields, or for specifying that a schema can have additional elements).

Eventually, from the GUI's perspective, these schemas would be used to generate a form to create a vault. This would also mean we could use the properties from the JSON schema to enforce the validation logic at this level.

Similarly, note that a vault doesn't necessarily need to have a schema applied to it. For example, we could have an "unrestricted" vault (with no schema applied) that contains a collection of directories, with each of these directories having a different kind of schema applied to it.

Additionally, for a cloned vault, we'd need to consider whether we also clone the schema. The answer here is most likely yes.

Finally, schemas should be identified with a name and/or ID.

While vault schemas can be user-defined, we should also have some native schemas for users (for example, login, credit card, etc).

As a side note regarding this, does this mean a "secret" is just one file of this schema? If that's the case, then we'd end up having multiple secrets that are parts of a "composed" secret. For example, for a credit card, we'd need to store 4 different secrets: cardholder name, card number, expiry date, CCV.
Right now on the CLI, we have a secrets add command. This means to add a credit card number, we'd need to make 4 separate calls to the CLI.
I suppose we could have the opportunity here to introduce "porcelain" commands. As a rough example, could be secrets add credit-card <cardholder name> <card number> <expiry date> <ccv>.
However, we'd also have a vault schema for a credit card. When adding a secret, how do we make this connection to the vault schema? It doesn't make any sense to only add a CCV number without the other card details, for example.

MatrixAI / Polykey

Vault and File Schema #222

Specification

Additional context

76

4

Tasks

1. A "relational database"-like structure

2. A specified collection of secrets

3. "Secret" schemas