MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
30 stars 4 forks source link

Replace JOSE with our own `tokens` domain and specialise tokens for Sigchain, Notifications, Identities and Sessions #481

Closed CMCDragonkai closed 1 year ago

CMCDragonkai commented 1 year ago

Specification

With the recent crypto update in MatrixAI/Polykey#446, we have found that the JOSE library is not longer compatible, and not even with the webcrypto polyfill.

Since JOSE is easier to replace than x509, we have decided to remove JOSE entirely.

This affects several domains:

  1. sigchain - claims here are general JWS structures, with multiple signatures
  2. identities - claims are being put on identity providers, but this was developed before we understood JOSE and JWS, these can now use specialised tokens
  3. notifications - using JWS to send "signed" messages, the encryption of these notification messages are reliant on P2P E2E encryption, but we don't store the messages as encrypted, they are stored unencrypted, but subsequently disk-encrypted, however we retain the signature of the messages as they are useful
  4. sessions - using JWS MAC authenticated tokens as a short-lived session token for authentication (remember you always exchange long-lived credentials for short-lived credentials)
  5. Our tokens should be based on the Ed25519 RFC for JWS: https://www.rfc-editor.org/rfc/rfc8037#appendix-A.4

This is the relationship of JOSE to what we are doing:

jwt and jws and jwe and jwk

Additional context

Tasks

  1. Renaming claims to tokens, we will have a generic "Token" structure
  2. Specialised tokens like TokenClaim or TokenSession can then be used by various domains like sigchain, identities and notifications and sessions
  3. Claims are now just a specific variant of the more general token.
  4. We store our domain types in the DB, but we can choose arbitrary "representations" of our tokens based on the JWS spec as compact, flattened or general representations. Although if we have multiple signatures, compact and flattened representations may not really make any sense.
  5. The sigchain, notifications, sessions and identities will be going through some renovation of their API in accordance with the latest ways of doing things.
  6. At the same time, record all the "reactive" points where property subscription could be useful, so this can be done in a subsequent issue.
CMCDragonkai commented 1 year ago

The hashing of the claims should be using multihash and not just a raw SHA256 hash. We can combine this with our new hashing utilities provided by the keys domain.

CMCDragonkai commented 1 year ago

Ok so we now have:

import * as keysUtils from '../keys/utils';

keysUtils.sha256();
keysUtils.sha512();

These use libsodium to do the hashing. I also have sha256G and sha256I variants for special usecases. Note that sha256G is a "consumer" generator that is pre-primed. That means you can start with next(Buffer.from('hello')) immediately. See https://stackoverflow.com/questions/74095105/explicit-generator-return-type-in-typescript-demands-that-generator-prototype-r/74095446?noredirect=1#comment130829319_74095446 On a side note that I figured that one could potentially derive a Consumer type from Iterator type that focuses purely on consumption and returning a result. I tried this, and while it works, I think some more iterations on it are required for actual generic consumption since there could many kinds of iterators.

Now the next issue is that multiformats now is only used in 2 places:

claims/utils.ts
client/utils/utils.ts

In client/utils/utils.ts, I believe it should instead by using Buffer directly since it's just doing base64 decoding, and we can just use Buffer polyfill even in non-node envrionments.

And so it's left to claims utils that uses it and js-id library which uses it for base encoding. Base encoding doesn't require C for speed, it's plenty fast by itself in JS, and we can continue using multiformats in that respect.

The main thing now is multihashing. Which if we want to re-use our multiformats structure, we can actually re-use our keys functions for that purpose. https://github.com/multiformats/js-multiformats#multihash-hashers

import * as hasher from 'multiformats/hashes/hasher'

const sha256 = hasher.from({
  // As per multiformats table
  // https://github.com/multiformats/multicodec/blob/master/table.csv#L9
  name: 'sha2-256',
  code: 0x12,

  encode: (input) => new Uint8Array(crypto.createHash('sha256').update(input).digest())
})

So a question is should these derivation sit inside the claims or token utilities or should they be put into a more general utilities somewhere? That way it's possible to use them in potentially other places where we may want to do multihashing.

CMCDragonkai commented 1 year ago

I've replaced multiformats usage in src/client/utils/utils.ts with just Buffer. It just used base64 padding encoding which is what Buffer normally does. This reduces some dependency on third party libraries here.

CMCDragonkai commented 1 year ago

For now I'll put the multihashing derivations into src/tokens/utils.ts. This will be the only place where this is used for now other than js-id.

CMCDragonkai commented 1 year ago

The current hashClaim uses:

  1. Canonicalized JSON encoding - https://www.rfc-editor.org/rfc/rfc8785
  2. Then uses SHA256
  3. Then uses hex encoding

Basically something like:

    hashClaim = hex(sha256(canonicalize(jsonClaim)))

However I want to digress about the JWS spec.

JWS already has some form of algorithm agility through the alg header property. For example we could use HS256 or ES256 or EdDSA. These headers dictate the algorithms used for generating and verifying signatures or MAC codes.

In the general format, it may look like this:

{
  payload: base64url(payload),
  signatures: [
    {
      protected: base64url({
        alg: 'EdDSA',
      }),
      header: {
        kid: '<NodeId>'
      },
      signature: base64url(signature)
    },
    {
      protected: base64url({
        alg: 'HS256',
      }),
      header: {
        kid: '<NodeId>'
      },
      signature: base64url(signature)
    }
  ]
}

Unlike JWE there is no shared header, so headers have to be specific to each signature.

This means our hPrev makes the most sense to be in the payload like it is currently.

Now a question is whether we can use alg to also determine agility of the hashing algo for hPrev.

Now that we cannot put it into a shared header, and the fact that we are using multibase encoding for the <NodeId>, then I believe we both multibase and multihash for our hPrev.

So anything that is locked down in the JWS spec will be multibase and/or multihash.

So I propose we use the following:

    hashClaim = multibase(multihash(canonicalize(jsonClaim)))

The selected multibase will just be base64url similar to the rest of the JWS spec. While the multihash can be SHA256 for now.

Another thing I want to touch on is that JWS has both typ and cty which can be used to disambiguate the JWS by providing media types.

The typ is for the complete JWS. The cty is for just the payload. But are optional. Both are meant to be processed by the application.

Both are case insensitive. It is recommended to omit application/ prefix when no other / appears in the value. So JWT should be understood as application/jwt. Or example is application/example. But if it is application/example;part="1/2", it should be understood as application/example;part="1/2".

This is something we can use later, like application/claim vs application/somethingelse to differntiate from sigchain tokens from identity tokens from notification tokens from session tokens. Note we would use something like application/vnd.polykey.sigchain-token+json.

There are some recommendations to use JOSE+JSON when using general or flattened format.

Finally for the alg, in this case we won't need to use a custom alg like we are doing in our JWE usage. But in the future, custom names should be collision resistant:

Collision-Resistant Name A name in a namespace that enables names to be allocated in a manner such that they are highly unlikely to collide with other names. Examples of collision-resistant namespaces include: Domain Names, Object Identifiers (OIDs) as defined in the ITU-T X.660 and X.670 Recommendation series, and Universally Unique IDentifiers (UUIDs) [RFC4122]. When using an administratively delegated namespace, the definer of a name needs to take reasonable precautions to ensure they are in control of the portion of the namespace they use to define the name.

Examples possible algs:

CMCDragonkai commented 1 year ago

Ok so I've come up with realisations here.

Consistency of JWS/JWT (token)

All the claims and JWS usage right now are all over the place, and we need to make it more consistent.

Here's an idea.

Let's start with a generic Token type:

/**
 * Token based on JWT specification.
 * All properties are "claims" and they are all optional.
 * The entire POJO is put into the payload for signing.
 */
type Token = {
  iss?: string;
  sub?: string;
  aud?: string | Array<string>;
  exp?: number;
  nbf?: number;
  iat?: number;
  jti?: string;
  [key: string]: any;
};

The above is based on the JWT specification. And specifically it just the payload. This payload is what gets signed. Everything else is just metadata. Although a portion of the metadata is also signed.

Then we can have specialise "tokens" for various usage. I suspect the tokens domain will specifically handle just generic tokens and produce TokenSigned structures.

/**
 * Signed token based on General JWS specification.
 */
type TokenSigned = {
  payload: string;
  signatures: Array<{
    signature: string;
    protected: string;
  }>;
};

In particular we know that the protected header which is base64url encoded of the JSON string will contain something like this:

{
  alg: 'EdDSA' | 'BLAKE2b' | 'HS512256'
}

Atm, sodium-native only exposes the blake2b algorithm and the HMAC-SHA512-256 (truncated 512). Even though libsodium has HS256 and HS512, so for now, we just focus on what sodium-native has.

The BLAKE2b is provided by crypto_generichash. It supports hashing without a key, and also hashing with a key. In the case of signing, I believe it would be useful for hashing with a key, and the key is just a symmetric key that is 32 bytes which is exactly the symmetric key sizes we use.

HS512256 is just what is exposed by crypto_auth. In the future we can get sodium-native to expose the HS256 and HS512 and then our tokens would actually be compatible.

I think since we're going to do this for now, we let's just use BLAKE2b.

In that case... we our multidigest work needs a bit of a tweak. Atm, we don't have blake2b, and I'm not sure which of these https://github.com/multiformats/multicodec/blob/master/table.csv are actually specific to libsodium. The default output size is 32 bytes or 256 bits. Therefore I think it is this one: blake2b-256 and 0xb220.

Subsequently we then need to create "specialised" tokens for all the different usecases we have for them.

Here are some draft tokens I've synthesised from the current situation in the codebase:

type TokenClaimNode = {
  jti: ClaimIdEncoded;
  iat: number;
  iss: NodeIdEncoded;
  sub: NodeIdEncoded;
  nbf: number;
  prev: string | null;
  seq: number;
};

type TokenClaimIdentity {
  jti: ClaimIdEncoded;
  iat: number;
  iss: NodeIdEncoded;
  sub: ProviderIdentityId;
  nbf: number;
  prev: string | null;
  seq: number;
};

type TokenNotification<T> = {
  jti: NotificationIdEncoded;
  iat: number;
  iss: NodeIdEncoded;
  sub: NodeIdEncoded;
  data: T;
};

type TokenSession = {
  iss: NodeIdEncoded;
  sub: NodeIdEncoded;
  iat: number;
  nbf: number;
  exp: number;
};

The old types were originally defined in the claims domain. So TokenClaimNode and TokenClaimIdentity are JWT synthesised versions of Claim and ClaimData. Compare the above to these old (current) types:

type ClaimLinkNode = {
  type: 'node';
  node1: NodeIdEncoded;
  node2: NodeIdEncoded;
};

type ClaimLinkIdentity = {
  type: 'identity';
  node: NodeIdEncoded;
  provider: ProviderId;
  identity: IdentityId;
};

type ClaimData = ClaimLinkNode | ClaimLinkIdentity;

type Claim = {
  payload: {
    hPrev: string | null; // Hash of the previous claim (null if first claim)
    seq: number; // Sequence number of the claim
    data: ClaimData; // Our custom payload data
    iat: number; // Timestamp (initialised at JWS field)
  };
  signatures: Record<NodeIdEncoded, SignatureData>; // Signee node ID -> claim signature
};

I believe the new iteration is far clearer and uses standardised terminology.

Note that ProviderIdentityId is meant to be a new JSON.stringify([ProviderId, IdentityId]) type cause it would make sense to do this for node to digital identity claims.

The jti can then be useful to keep track of an ID for these tokens. In the sigchain, this is ClaimIdEncoded. But in notifications, it would be NotificationIdEncoded.

One important thing about the NotificationIdEncoded, is that currently it takes additional data. However I will flatten this when I get to the notifications types. Instead I'll use an intersection type to take the common notification payload properties and combine it with additional properties specific to particular kinds of notifications. We currently have 3 kinds GestaltInvite, VaultShare and General.

The session token doesn't have a jti at the moment because the session token is not saved in the DB anywhere. It's just handed to the client for persistence.

This is why the generic token interface does not have any required properties.

The tokens domain will expose functionality for signing tokens, encoding them, and decoding them. However each of the specialise tokens will be declared by their respective domains.

The claim tokens are used by both identities and sigchain domain. This is because we're going to double-post the claims from sigchain to identity, and between sigchain to sigchain. This means claims domain may still exist to share this functionality. Some documentation should be written into the claims domain to indicate what this represents. Furthermore claims keyword is pretty overloaded. It might be good to disambiguate this "claims" usage with something else. We used to call these "links". It may be better to change it back to calling them "links". And thus IdentityLink and NodeLink can be clearer for both identities and sigchain.

The session token type will still be in sessions. The notifications token type will still be in notifications.

The tokens/utils should have:

Both functions will produce a TokenSigned structure.

Note that doubly signed claims will be made alot simpler. They will be stored on both sigchains. The iss is always the initiator of the node link.

CMCDragonkai commented 1 year ago

The tokens domain seems to have cross overs with the KeyManager repurposing in MatrixAI/Polykey#472. Both would have implications for secure computation MatrixAI/Polykey#385. I imagine further innovations coming there to include not just JWT, but taking ideas from PASETO, biscuits... etc.

CMCDragonkai commented 1 year ago

Ok I've replaced JSON.stringify in the keys utils with canonicalize since that ensures deterministic serialisation. Which is useful when we want to later re-generate hashes.

So now keys utils exposes hashWithKey and its variants. We've chosen BLAKE2b as the symmetric hash to use here. Furthermore, the token utilities ends up with signWithKey and signWithPrivateKey. Which takes a key, token payload, any additional header data and then produces a TokenSignature.

We don't go straight to producing a signed token structure because it can be mutated. Instead one may combine these things together to produce a TokenSigned, but one can also mutate and add additional signatures to the TokenSigned.

It seems that a safe way of doing this is to construct an rich object representing a token, and the ability to add and remove signatures at will, then the ability to produce the serialised representation. That way, one would not just be able to mutate the token payload, or just add random signatures in. It would encapsulate the mechanics of tokens internally.

CMCDragonkai commented 1 year ago

So I've renamed them to be more clear:

  macWithKey,
  macWithKeyG,
  macWithKeyI,
  authWithKey,
  authWithKeyG,
  authWithKeyI,

MAC is clearer than hash. Since we are not just producing a digest, but a MAC which ensures integrity and authenticity.

CMCDragonkai commented 1 year ago

We now have a Token class. This sits in src/Token.ts.

It exposes several methods:

During verification, the default behaviour is to iterate over the signatures to verify the token. This means it actually algorithm or signature doesn't match, it just goes to the next one. This may not be efficient if there are lots of signatures on a token. However this is not likely to be the case, so at most O(2) for all our existing usecases. We may need to apply some operational limits when receiving token data either for notifications, sigchain and more.

This Token represents a single token. Now we can imagine, that the "TokenManager" does not exist at this point in time. This is because instead we have many token managers. The Sigchain is one particular token manager, managing specifically identity link and node link tokens. While the NotificationManager manages notification tokens... etc.

So for now there is no "general" token manager. Just like how there is no "general" key manager. Such things would only come into play during secure computation. Both of which would have to be built on top of the vault system to take advantage of high-level secret exchange like git sync MatrixAI/Polykey#385. There are of course limitations here... especially if the FS paradigm is too limiting for such secrets. But generally high level UIs tend to fit into the FS metaphor. But we could imagine a KeyManager and TokenManager sitting on top of the vault system.

CMCDragonkai commented 1 year ago

So therefore hashClaim is really the domain of Sigchain as it is specifically about hashing the previous link token. We are not using the "claim" keyword here to avoid confusion with "claim" properties that go into the token.

CMCDragonkai commented 1 year ago

JWKs are just a particular variant of a Token payload.

The token payload has some "registered claims", but these are all optional. So a payload is ultimately just any structured data that can be represented by JSON.

The token payload can be signed, which turns it into a JWS.

A token payload can be encrypted, which turns it into a JWE.

A token payload can be signed then encrypted, which means it becomes a JWS, then becomes a JWE. This allows arbitrary nesting of signing and encryption.

The expected "payload" when signing a JWT was meant to be a base64url of the JSON stringification.

When it's a JWE though, we also have different representations for the JWS if we want to use it as a payload like compact, general and flattened. So I'm not sure if it matters which representation you are encrypting. Because JWE only desires that the plaintext is some octet-sequence string.

This could mean that it depends on the cty which tells the end user what the content type of the plain text.

The spec seems to recommend that when nesting a JWT, the cty should be application/jwt or just JWT. I find it strange that application/jose appears to be conflict with application/jwt.

CMCDragonkai commented 1 year ago

Ok I've actually worked out what all these JW* specs relationship between each other, and how we should map these to the domains we have.

jwt and jws and jwe and jwk

So for now we will have keys and tokens domain cover all these usecases for now. However in the future, we could expand it to be more flexible, especially for JWE.

CMCDragonkai commented 1 year ago

The Token class has been fleshed out.

The next step is testing for the Token.

We also need to adapt the claims domain to be producing specialise token subtypes.

In particular the sigchain shouldn't actually be storing the TokenSignedEncoded, because it's better that we have the data available. Its' only when we are about to show the end user, would it be worth showing the encoded data, or when being transferred.

Inside the sigchain, it makes sense to leave the payload unencoded, so it will be just JSON stringified.

However if the JSON data contains binary data, then this becomes complicated. We now that TokenHeaderSignature contains a binary signature data.

Another idea is that the sigchain can store the signatures separately. After all the token format isn't really designed for efficient storage. It's possible that the payload can be just JSON, while the signature should be stored more efficiently elsewhere.

Anyway the internal storage format of sigchain just needs to be flexible and optimised, it doesn't have to be the same as the token format.

CMCDragonkai commented 1 year ago

The difference between tokens and claims.

Not all tokens are claims. But all claims are tokens.

Claims are signed. But the base type of claims is just the structured data. The signed version of the data is ClaimSigned.

interface Claim extends TokenPayload {
  jti: ClaimIdEncoded;
  iat: number;
  nbf: number;
  prev: string | null;
  seq: number;
}

interface ClaimSigned<T extends Claim> extends TokenSigned {
  payload: T;
  signatures: Array<{
    protected: {
      alg: 'EdDSA';
      kid: NodeIdEncoded;
      [key: string]: any;
    };
    signature: Signature;
  }>;
}

And the payloads:

interface ClaimLinkIdentity extends Claim {
  iss: NodeIdEncoded;
  sub: ProviderIdentityId;
}

The claims domain is shared by many domains. Like identities, discovery, nodes... etc. This is why it is outside at the top level, and not inside the sigchain.

CMCDragonkai commented 1 year ago

So the sigchain doesn't have to store things exactly as a JWT. That was a source of complexity before, where the data being stored was also encoded, or there was an encoding/decoding process for the entire JWT spec.

Instead the Sigchain can store claims and sigchain in a normalised fashion. These are also the un-encoded versions.

Sigchain/claims/{ClaimId} -> {Claim}
Sigchain/signatures/{ClaimId}/{SignatureIndex} -> {ClaimHeaderSignature}

Regarding the SignatureIndex, the only thing we need to do is to use lexicographic to encode an "array index. That is 0, 1... etc.

However we won't exactly know what is the "oldest" index, it would require looking up the previous number first, before then making a put. This is not some global counter like or ID generators or sequence.

The alternative is to store the entire array in one go. But as we saw before with the other domains, we are storing things in a more normalised way to lift the data structure to the DB's knowledge.

This does mean it's possible to create new claims, and then sign claims from the past. It's also possible to remove signatures too... but this should also be disallowed by the interface of the sigchain.

What does it mean to add a new signature to a past claim? Does it change the semantics of the claim? I don't think so. It's ok to add new claims, and add new signatures to existing claims. Things still work.

However it should not be allowed to remove claims or remove signatures. Only forward movement is allowed.

Of course, one can always destroy the data as well, that changes thing a bit. No replication consensus just yet.

If we do later want to do a blockchain consensus on this, I think then the signing past claims are not allowed.

Yes, because the hash is calculated over the entire signed claim, not just the claim payload.

CMCDragonkai commented 1 year ago

In the sigchain, I'm currently reviewing the addClaim method.

I can see there's a challenge here in terms of managing intermediate claims or doubly signed claims, or claims that need to be signed multiple parties.

In the current staging branch, this involves several additional methods and types for incrementally processing the claim. And this translates to having a method like addExistingClaim.

I don't like the way this is structured. We can simply this procedure.

The first thing to understand is that adding claims to the sigchain is serialised process. Only one can be done at a time.

The second thing to realise is that one cannot sign or manipulate signatures of existing claims in the sigchain. The sigchain is "immutable". It's append only. This is enforced through the hashing property, which contains the hash of the previous signed claim including all the signatures.

This means adding a claim in must be transactional, and therefore any doubly or multi-party signed claim must be done before the claim is entered into the Sigchain.

With the availability of transactions and locks we can do this now with just one method. This method can be addClaim, but it must now take an additional callback. This callback provides a signed claim that isn't yet put into the database, but allows further signing on the same claim structure.

Once the callback is executed and finishes we can then put the claim in the DB and commit the transaction.

The transaction will also lock to ensure serialised usage here.

During doubly signed processes, they will need to have timed cancellable deadlines so they cancel the operation. Similarly one can cancel the transaction by throwing an exception while in the transaction.

However if one just calls sigchain.addClaim() without a transaction, a transaction is created, and there is an assumption that the call will just succeed. But if an exception is thrown inside the callback, then that will also cancel the operation too.

CMCDragonkai commented 1 year ago

The callback is called signingHook.

    signingHook?: (token: Token<Claim>) => Promise<void>,

This takes the token which is now parameterised to Claim, and it is expected it may make mutations on the token. The Token interface limits to append-only mutations of adding new signatures... etc.

But now by exposing the Token, we are allowing potentially other operations like symmetric MAC codes.

This means, our ClaimProtectedHeader can't actually be guaranteed to be digital signatures. In such a case, we may remove ClaimHeaderSignature requirement... and make it more generic.

CMCDragonkai commented 1 year ago

I'm removing the ChainData and ChainDataEncoded types and associated functions, none of these needs to be used, since verifying claims should be done directly over each individual claim. It doesn't make sense to take the entire sigchain data.

Will need to add indexes into the sigchain now based on usecase to know what kind of information we need to operate over.

CMCDragonkai commented 1 year ago

One of the weird things that was done was that ChainData which is a serialised version of an entire sigchain data was put into the NodeInfo type.

/**
 * Serialized version of a node's sigchain.
 * Currently used for storage in the gestalt graph.
 */
type ChainData = Record<ClaimIdEncoded, Claim>;

/**
 * Serialized version of a node's sigchain, but with the claims as
 * Should be used when needing to transport ChainData, such that the claims can
 * be verified without having to be re-encoded as ClaimEncoded types.
 */
type ChainDataEncoded = Record<ClaimIdEncoded, ClaimEncoded>;

You can see here that this would be scalable. The chain data could grow forever. Why do we need the entire chain data in the node info? Surely there's only some information that is relevant...?

Apparently this information is then stored in the gestalt graph. So I have a feeling this information is being used for the gestalt linking. However this is of course not efficient.

CMCDragonkai commented 1 year ago

Ok we need to do some intermediate testing of the new sigchain for now before we proceed.

This would give us the ability to know whether our tokens, claims and sigchain domains are working with the current refactoring.

CMCDragonkai commented 1 year ago

Then we have to figure out the sigchain integration into the NodeInfo and all the downstream effects of all the changes. And probably circle back to sigchain to add in the necessary indexing. Finally apply the token changes to notifications and sessions.

CMCDragonkai commented 1 year ago

Ok sigchain has now fastcheck doing property tests on all of its methods.

We can now proceed with sigchain integration into NodeInfo, and then bring in indexing into the sigchain to deal with MatrixAI/Polykey#327, in particular the ability to figure out what the current state of a gestalt is.

We have the ability to link node to node, and link node to identity.

Things we want to ask the sigchain:

  1. What are the all the nodes you are currently linked to?
  2. What are all the identities you are currently linked to?

Without revocations, and assuming all claims are only link tokens, it's sufficient to just download the entire sigchain data and use that. That might explain why the entire chain data was passed around.

However with revocations it makes sense that we would want index the sigchain by connections to other nodes or identities. Then one can lookup the index instead.

Should our indexing be hardcoded or something that is doable based on the user? In the tasks system, tasks can be arbitrarily indexed by paths. However this makes less sense if we have specific operations on the sigchain for looking up these identity claims.

Right now the sigchain can take any input claim data. But when indexing, it only makes sense to do that for specific claims, and ignore the other kinds of claim types.

We may need to change the ClaimInput type so that it instead choose specific kinds of claims, and such claims must have a type key so we can disambiguate what kind of claim they are, and then appropriately work them into the index.

Automatic indexing on the DB was meant to allow us to easily create indexes instead of doing it manually, but even more automatic would be something that creates indexes over the entire structure of the data. That might look like indexing every key.

Note that indexing something right now makes the key and value unencrypted because keys cannot be encrypted atm.

Ok I think we start with a set of allowed claim types. Then index them appropriately. The sigchain can provide domain specific operations, as we expect for identity operations, and in the future other kinds of operations for other kinds of claims.

CMCDragonkai commented 1 year ago

The NodeInfo type is this:

/**
 * Data structure containing the sigchain data of some node.
 * chain: maps ClaimId (lexicographic integer of sequence number) -> Claim
 */
type NodeInfo = {
  id: NodeIdEncoded;
  chain: ChainData;
};

It is used by GestaltGraph and Discovery.

It is in fact not used in the nodes domain.

There's a part of the code in NodeManager that calls some data closesNodeInfo. But that's really incorrect, since it is just [NodeId, NodeData].

I'm going to change the name of that variable to ensure that we can see that this is mostly a gestalt graph sort of thing.

It ends up being used as part of:

type GestaltNodes = Record<GestaltNodeKey, NodeInfo>;

Which is then placed into:

type Gestalt = {
  matrix: GestaltMatrix;
  nodes: GestaltNodes;
  identities: GestaltIdentities;
};

This would imply that during our GG currently stores a static ChainData type which I guess is acquired when the node is discovered and entered into the GG.

Comparing to IdentityInfo is used in a similar way, it has something like:

/**
 * Data related to a particular identity on an identity provider.
 * claims: a map of IdentityClaimId to an (identity -> keynode) claim
 */
type IdentityInfo = IdentityData & {
  claims: IdentityClaims;
};

So the problem here is the GG is storing a static data, that was discovered.

This data should be considered "stale" as soon as the data was acquired. And we should also figure out exactly what kind of data we actually need here.

Given that IdentityClaims is the identity claim ID to the identity claim.

I think the main idea here is that the GG stores IdentityInfo and NodeInfo into these 2 paths:

  protected gestaltGraphNodesDbPath: LevelPath = [
    this.constructor.name,
    'nodes',
  ];
  protected gestaltGraphIdentitiesDbPath: LevelPath = [
    this.constructor.name,
    'identities',
  ];

The identity info is meant to be a record mapping an "identity claim ID" to the identity claim.

The node info is meant to contain the claims that the node is also claiming.

I think these 2 types make more sense to put it into the gestalts domain`.

Both types are only used by gestalts and discovery.

These types are not relevant to the identities and nodes, since they are more relevant to gestalts and subsequently discovery which is built on top of `gestalts.

CMCDragonkai commented 1 year ago

The IdentityClaimid and IdentityClaim does make sense to be used in identities.

However the IdentityClaim does seem a bit confusing.

The reason is because a Claim now is a token that is saved in the sigchain. And IdentityClaim is simply an augmented type, it augments the claim with id and url properties.

These properties are not part of the claim itself, because the claim would not have this information before it is posted.

This means anything actually returning IdentityClaim is providing a structure that can be hard to decipher.

For example the methods of Provider.publishClaim, Provider.getClaim all return IdentityClaim.

Furthermore the Claim type has changed, it used to be the entire signed token. It now focuses solely on the token payload. The full information posted to the identity would be the signed claim.

Therefore these methods would have to change.

Firstly they should be returning something that contains the SignedClaim.

Secondly the additional metadata shouldn't be part of the same structure being called a claim, it's not actually part of the claim. It's just metadata returned as part of the response that can be useful.

So what I'm doing is this:

  1. Changed IdentityClaimId to ProviderIdentityClaimId, moved this to ids/types.
  2. Changed IdentityClaim to IdentitySignedClaim to indicate a change of wrapping SignedClaim.
  3. Change all of the identities to now return IdentitySignedClaim.

This has now resulted in the GH provider using claims/schema to validate the claims, which is currently using claimIdentitySchema.

This schema verification is a bit of a problem, because it should be defined as part of the payloads in this case ClaimLinkIdentity.

So next thing is to update the claims domain with the relevant runtime validators for each relevant payload.

One thing that is different is that we are not publishing SignedTokenEncoded or SignedClaimEncoded. This is because such encoding is not human readable. The encoded version is literally a General JWS JSON format. But it contains base64url content for the payload. So instead we are now publishing the non-encoded version SignedToken instead of SignedTokenEncoded. This has meant that validation routines can actually be done on a per-payload basis.

At the same time, we would expect any published token to be the signed token, so we would not bother validating non-signed tokens. So all the schemas would all be verifying "signed tokens".

CMCDragonkai commented 1 year ago

There's a problem with publishing the SignedClaim<ClaimLinkIdentity>.

It's the same problem that sigchain has with JSON encoding of the binary signatures (which is I why I created the ClaimHeaderSignatureJSON). And also created this issue https://github.com/MatrixAI/js-db/issues/58

In the sigchain we ended up storing the JSON encoding of the buffer.

So the problem is this, we have SignedToken and SignedTokenEncoded. The encoded version is a general JWS JSON. But general JWS JSON is not human readable, the payload is encoded.

So what we really want is something in between. Something where the payload isn't encoded into base64url, but where the signatures are encoded into base64url.

This means JWS is just kind of not well designed for this usecase - look this is just not human readable:

``` { "payload": "SXTigJlzIGEgZGFuZ2Vyb3VzIGJ1c2luZXNzLCBGcm9kbywg Z29pbmcgb3V0IHlvdXIgZG9vci4gWW91IHN0ZXAgb250byB0aGUgcm9h ZCwgYW5kIGlmIHlvdSBkb24ndCBrZWVwIHlvdXIgZmVldCwgdGhlcmXi gJlzIG5vIGtub3dpbmcgd2hlcmUgeW91IG1pZ2h0IGJlIHN3ZXB0IG9m ZiB0by4", "signatures": [ { "protected": "eyJhbGciOiJSUzI1NiJ9", "header": { "kid": "bilbo.baggins@hobbiton.example" }, "signature": "MIsjqtVlOpa71KE-Mss8_Nq2YH4FGhiocsqrgi5Nvy G53uoimic1tcMdSg-qptrzZc7CG6Svw2Y13TDIqHzTUrL_lR2ZFc ryNFiHkSw129EghGpwkpxaTn_THJTCglNbADko1MZBCdwzJxwqZc -1RlpO2HibUYyXSwO97BSe0_evZKdjvvKSgsIqjytKSeAMbhMBdM ma622_BG5t4sdbuCHtFjp9iJmkio47AIwqkZV1aIZsv33uPUqBBC XbYoQJwt7mxPftHmNlGoOSMxR_3thmXTCm4US-xiNOyhbm8afKK6 4jU6_TPtQHiJeQJxz9G3Tx-083B745_AfYOnlC9w" }, { "header": { "alg": "ES512", "kid": "bilbo.baggins@hobbiton.example" }, "signature": "ARcVLnaJJaUWG8fG-8t5BREVAuTY8n8YHjwDO1muhc dCoFZFFjfISu0Cdkn9Ybdlmi54ho0x924DUz8sK7ZXkhc7AFM8Ob LfTvNCrqcI3Jkl2U5IX3utNhODH6v7xgy1Qahsn0fyb4zSAkje8b AWz4vIfj5pCMYxxm4fgV3q7ZYhm5eD" }, { "protected": "eyJhbGciOiJIUzI1NiIsImtpZCI6IjAxOGMwYWU1LT RkOWItNDcxYi1iZmQ2LWVlZjMxNGJjNzAzNyJ9", "signature": "s0h6KThzkfBBBkLspW1h84VsJZFTsPPqMDA7g1Md7p 0" } ] } ```

We're already augmenting the JWS spec. So I guess at this point we might as well and go one step further. We will have a "human readable" format.

The main reason why JWS general format/JSON format still base64url encodes JWS is due to this https://github.com/MatrixAI/Polykey/issues/481#issuecomment-1286527334. As in JWS's payload could be non-JSON, it could be anything else. But for our usecase, the payload is JSON and in fact, it's meant to be a proper JWT. But the end result is that we get a non-human readable payload. There should have been a JWT format (in JWS) that had human readable messages. Maybe something like "Super-General" format, not just General format.

So... we go from compact, to flattened to general to human readable format. In such a format, one would argue that the payload shouldn't be encoded, the protected headers shouldn't be encoded. Nothing should be encoded EXCEPT the binary signatures and MAC codes.

Let's created a new format for JWT-JWS, the human readable format. This format doesn't have anything base64url encoded except for signatures.

CMCDragonkai commented 1 year ago

I'm thinking that the identity provider plugin can decide how the SignedClaim<ClaimLinkidentity> will look like on the system. So for a github gist, it makes sense to create a markdown file to be readable, and maybe a separate file, or just embed it within the markdown file.

In that case, it actually means we are liking to have a "human readable" portion dictated by the identity provider. Meaning the token itself doesn't have to be that human readable.

It is possible for the identities domain to make use of the claims utilities for separate encoding operations on the signature while leaving other parts as is. Or they just end up posting the SignedTokenEncoded.

Also thinking that it could also take the Token structure... but that's more for an object representing a live token, at the point of the identity provider they should just be taking the realised structure.

CMCDragonkai commented 1 year ago

Ok I actually tried to solve this in a couple ways.

  1. The first thing I wanted to do is to ensure that upon publishing the signed claim, the returned signed claim should be equal to the actual structure of the claim that is published, even if it is not equal to the input signed claim. This ended up calling decodeSignedClaim afterwards, but it seems kind of inefficient and easy to forget to do.
  2. The second thing was attempting to create a sort of JSONValue type which only allows values that can be cleanly encoded to JSON. This seemed to work until I had to deal with objects.
interface ToJSON {
  toJSON: (key?: any) => string;
}

type JSONValue =
  { [key: string]: JSONValue } |
  Array<JSONValue> |
  ToJSON |
  string |
  number |
  boolean |
  null;

The inability to enforce that undefined is not allowed on the ToJSON objects is problematic.

It just ends up being a very strict JSON input, but it's types limiting what JSON.stringify can take, and at the end of the day, we're still not really banning objects.

It seems what we need is a function to do what stringify would do, but not actually turn it into a string.

It's either we do it at runtime, meaning we dry run stringify and return that structure (like filterForJSON), or we do it statically by preventing certain data.

If we don't do this, we run the risk of potentially returning a SignedClaim that isn't actually equal to the SignedClaim when read from either sigchain or identity provider.

This could be remedied by providing a special equality function that discards any undefined properties and converts null... etc.

I guess the issue is that SignedClaim isn't really a pure POJO, with especially with the signature buffers, and there isn't any type-level constraint on what values can in the payload since it's any.

A way to limit it to primitives is something like:

type JSONValue =
  { [key: string]: JSONValue } |
  Array<JSONValue> |
  string |
  number |
  boolean |
  null;

type TokenPayload = {
  iss?: string;
  sub?: string;
  aud?: string | Array<string>;
  exp?: number;
  nbf?: number;
  iat?: number;
  jti?: string;
  [key: string]: JSONValue;
};
CMCDragonkai commented 1 year ago

But we do hit this problem https://stackoverflow.com/questions/59896317/typescript-key-index-signature-only-for-all-properties-that-are-not-defined.

So the workaround is this:

type TokenPayload = {
  iss?: string;
  sub?: string;
  aud?: string | Array<string>;
  exp?: number;
  nbf?: number;
  iat?: number;
  jti?: string;
  [key: string]: JSONValue | undefined;
};

This is because ? allows undefined as a value, and we cannot say... rest index signatures yet in TS.

Seems like it would be nice to have a type that is strict on optional properties, that is if they are optional, then they must not exist. But here we allow top level properties to be undefined.

CMCDragonkai commented 1 year ago

So this ensures that addClaim using ClaimInput is also enforcing that the values must be JSONValue.

However runtime-wise, this may still be a concern. Which is why at the end of the addClaim we read it out from the DB to return the actual claim stored.

The tests now have to be updated to force the unknown data being generated to be ClaimInput even though we know that they may not be abiding by the JSONValue type.

One issue is that doing this, doesn't work well with DeepReadonly. So at this point we just do a @ts-ignore for that.

CMCDragonkai commented 1 year ago

I think I'm deciding not to do any roundtripping or JSON normalisation here. It's because in the sigchain, you pass in a ClaimInput and you get back a SignedClaim. But in the case of identity, it's already a SignedClaim so now with our JSONValue type, it should be expected that this data is already normalised with respect to JSON requirements.

That is, we normalise the inputs, not the outputs. And one doesn't create a signed claim from scratch... at least not from the sigchain.

CMCDragonkai commented 1 year ago

The major change is that Provider.publishClaim uses tokensUtils.encodeSigned in order to turn into a proper JWS-JWT for publishing. The github gist will be less human readable until we bring a human readable schema that encompasses that information.

CMCDragonkai commented 1 year ago

Ok so the tokens and claims both now have JSON schema files. These schema files are here because it is expected there would be external validation of these structured data.

A long time ago, when we first started, there was an assumption that we could use JSON schema completely for validation. However now we sort of have 3 validation mechanisms:

  1. JSON schema - this is like using a declarative system like a parser-generator
  2. The validation domain and their parse functions and using matchSync for poor man's pattern matching - these provide a sort of parse-tree kind of errors, you can think of these as top down parsers
  3. The decoding utility functions - these are atomic all or nothing functions, you either get the final datum or undefined

The realisation was that JSON schema is not capable of validating everything we could possibly want. And there was a brief exploration of io-ts as an alternative but it didn't seem to nicely interoperate.

At this point since we end up having lower-level validation routines, one might ask what's the utility of having the schemas at all?

The main benefits were:

  1. Standardised validation routines
  2. Standardised validation documentation
  3. Integration into API (like OpenAPI) assuming a REST API were available

But we aren't really making use of these benefits significantly.

Furthermore, json schema isn't that simple. There's the issue of an additional layer of modules. That is JSON schemas have to be able to reference other JSON schemas, and this referencing relies on the JSON schema compiler to be able to provide a module system. So far, with ajv you either have a static mechanism or dynamic, but dynamic means the schemas are only available after asynchronous loading. The only way to do elegantly this is to be able to use top level await.

At this point, our json schemas are kind of useless. We can't really expose them via any URL endpoint, they aren't actually used in any API, our RPC is still GRPC using protobufs and moving forward, we may use RPC with something other than JSON, they are hard to maintain.

The linkage between TS and JSON schema is difficult to maintain, the JSONSchemaType utility type is just too complex to use. So I'm removing it in the claims/schema.ts.

Should we even be maintaining json schema files at all? We could just write procedural validation routines like have been doing with the validation domain, parsers, and decoding functions.

In a way, we can continue to use all 3 while understanding their roles:

  1. Using a declarative parser generator - JSON schema
  2. Manual top down parser - validation domain
  3. Lexical tokens - the decoding utility functions

But I get the feeling to consolidate, especially if we aren't really going to benefit from having a standardised schema to present to the end consumer who is dependent on the RPC API.

CMCDragonkai commented 1 year ago

I think the for the tokens and claims, these should be rolled into the validation routines, we have no real need to have them as JSON schema, since these are never going to be exposed at an API-level (like I have done before with OpenAPI). JSON schema can be useful once JSON RPC gets provided (or a variant of JSON like CBOR to enable binary data RPC)

CMCDragonkai commented 1 year ago

I think top-level await will be needed before we can really make use of the JSON schemas well, and mostly at the highest-level validation routines involving the API where we get a good RoI for doing something like OpenAPI.

This means all the JSON schemas in tokens, claims, notifications and status should just be removed in favour of top-down parsers in the validation domain.

This can be done for a later work though as I'm continuing down the path of fixing up token usage.

If we proceed with this we may want to either centralise all the parsers into the validation domain, they currently make use of all the decoding functions anyway, so or we can spread out the parsers across the domains. It depends... In many cases we want to be able to do validationUtils.parseX where X could be anything, that is centralised access. But in the other terms, we just want to be able to validate a single object for a given domain, and we are usually just using decoding functions but that may not be suitable.

They shouldn't be importing the validation utils as validation is a highly connected module and that means any breakage any part of the code will break everything depending on the validation utilities.

So replacing JSON schemas with validation parsers should also end up decentralising the parsers across the codebase, while the validation domain can re-export those parsers so they can be used in one place.


The ids is centralised because it doesn't depend on anything else. So it is a good common factor.

The validation should be decentralised (but it can re-export in a centralised way) because it actually ends up depending on everything else. It ends up being a schwerpunkt, a sort of nexus that creates a Single Point of Cascading Failure (a keystone? a nexus? a crux?). We want to avoid center of gravities here in our codebase and validation is one of them. The ids and utils by comparison are not a problem, cause they are self-contained.

MatrixAI/MatrixAI-Graph#44 has been created to address input validation refactoring and the decentralising of the validation domain.

CMCDragonkai commented 1 year ago

It looks like the identities domain has been fixed for now. It's time to move to the gestalts and discovery and notifications.

CMCDragonkai commented 1 year ago

With regards to MatrixAI/MatrixAI-Graph#44, one important thing to realise is that JSON schemas are limited to JSON data, not to arbitrary JS data. Arbitrary JS POJOs have to use validation parsers, not the JSON schema.

This is another reason it only makes sense to have JSON schemas on the program boundaries, the actual serialised JSON data going in and out, rather than any arbitrary data.

Thus SignedToken is not something can be validated with JSON schema because it can contain non-JSON values in particular the signature is a buffer there.

Only SignedTokenEncoded can be validated with JSON schema cause that's the only thing that is the actual input/output JSON. The SignedToken is an internal type within PK, not an external type.

CMCDragonkai commented 1 year ago

I've started on this idiom:

  1. encodeX/decodeX is used for "lexing" all-or-nothing types
  2. generateX/parseX is for generating and parsing structured types with exceptions as parse errors
  3. validateX - is used by JSON schema, and only for "external types" and only JSON

The X is always the internal type name. So generateTokenPayload and parseTokenPayload means to generate a serialised representation and to parse the serialised representation respectively.

I'm using the unknown type for the parsing functions, and also I'm waiting on 4.9.x TS to be released so we can get proper usage of 'x' in o type guards: https://github.com/microsoft/TypeScript/issues/21732

CMCDragonkai commented 1 year ago

The tokens domain is now fully tested with fast check. The only schema left is the SignedTokenEncoded as this is the only structure that is an external type. All other types are internal types, so the parseX functions are used to parse data to them.

CMCDragonkai commented 1 year ago

I had to change to using Readonly instead of DeepReadonly, the DeepReadonly just resulted in infinite type recursion.

CMCDragonkai commented 1 year ago

All the JSON schema validations for singly signed or doubly signed claims are removed. Instead the Token class will be used for verifying things by Sigchain, identities or gestalts.

It's also that we aren't special casing any of these claims. Instead claims are built up on top of tokens, and all token operations still work on claims.

CMCDragonkai commented 1 year ago

If we have types SignedClaimLinkIdentity and SignedClaimLinkNode we can enforce here that the link identity would have 1 signature, and link node would have 2 signatures. But I'm not sure if we really need these special types. Right now claims are just specialised payloads extending TokenPayload. The constraints of being a signed claim should just be done on entry in the PK system.

CMCDragonkai commented 1 year ago

The normal Claim and TokenPayload or even SignedClaim objects cannot be sent over the wire because they contain unserialised data, like signatures. They must be turned into encoded version before being sent, as the encoded versions actually aligns with JWS general format.

CMCDragonkai commented 1 year ago

These grpc related utilities are left over here for posterity until I can adapt it to the new structure:

```ts /** * Constructs a CrossSignMessage (for GRPC transfer) from a singly-signed claim * and/or a doubly-signed claim. */ function createCrossSignMessage({ singlySignedClaim = undefined, doublySignedClaim = undefined, }: { singlySignedClaim?: ClaimIntermediary; doublySignedClaim?: ClaimEncoded; }): nodesPB.CrossSign { const crossSignMessage = new nodesPB.CrossSign(); // Construct the singly signed claim message if (singlySignedClaim != null) { // Should never be reached, but for type safety if (singlySignedClaim.payload == null) { throw new claimsErrors.ErrorClaimsUndefinedClaimPayload(); } const singlyMessage = new nodesPB.ClaimIntermediary(); singlyMessage.setPayload(singlySignedClaim.payload); const singlySignatureMessage = new nodesPB.Signature(); singlySignatureMessage.setProtected(singlySignedClaim.signature.protected!); singlySignatureMessage.setSignature(singlySignedClaim.signature.signature); singlyMessage.setSignature(singlySignatureMessage); crossSignMessage.setSinglySignedClaim(singlyMessage); } // Construct the doubly signed claim message if (doublySignedClaim != null) { // Should never be reached, but for type safety if (doublySignedClaim.payload == null) { throw new claimsErrors.ErrorClaimsUndefinedClaimPayload(); } const doublyMessage = new nodesPB.AgentClaim(); doublyMessage.setPayload(doublySignedClaim.payload); for (const s of doublySignedClaim.signatures) { const signatureMessage = new nodesPB.Signature(); signatureMessage.setProtected(s.protected!); signatureMessage.setSignature(s.signature); doublyMessage.getSignaturesList().push(signatureMessage); } crossSignMessage.setDoublySignedClaim(doublyMessage); } return crossSignMessage; } /** * Reconstructs a ClaimIntermediary object from a ClaimIntermediaryMessage (i.e. * after GRPC transport). */ function reconstructClaimIntermediary( intermediaryMsg: nodesPB.ClaimIntermediary, ): ClaimIntermediary { const signatureMsg = intermediaryMsg.getSignature(); if (signatureMsg == null) { throw claimsErrors.ErrorUndefinedSignature; } const claim: ClaimIntermediary = { payload: intermediaryMsg.getPayload(), signature: { protected: signatureMsg.getProtected(), signature: signatureMsg.getSignature(), }, }; return claim; } /** * Reconstructs a ClaimEncoded object from a ClaimMessage (i.e. after GRPC * transport). */ function reconstructClaimEncoded(claimMsg: nodesPB.AgentClaim): ClaimEncoded { const claim: ClaimEncoded = { payload: claimMsg.getPayload(), signatures: claimMsg.getSignaturesList().map((signatureMsg) => { return { protected: signatureMsg.getProtected(), signature: signatureMsg.getSignature(), }; }), }; return claim; } ```
CMCDragonkai commented 1 year ago

The claims/errors is left over. The new claims domain doesn't use it at all. These are going to be probably removed when we track down all references.

CMCDragonkai commented 1 year ago

The gestalts domain can use some documentation.

  /**
   * Gestalt adjacency matrix represented as a collection of each vertex
   * mapping to the set of adjacent vertexes.
   * Kind of like: `{ a: { b, c }, b: { a, c }, c: { a, b } }`.
   * Each vertex can be `GestaltNodeKey` or `GestaltIdentityKey`.
   * `GestaltGraph/matrix/{GestaltKey} -> {json(GestaltKeySet)}`
   */
  protected gestaltGraphMatrixDbPath: LevelPath = [
    this.constructor.name,
    'matrix',
  ];

  /**
   * Node information
   * `GestaltGraph/nodes/{GestaltNodeKey} -> {json(GestaltNodeInfo)}`
   */
  protected gestaltGraphNodesDbPath: LevelPath = [
    this.constructor.name,
    'nodes',
  ];

  /**
   * Identity information
   * `GestaltGraph/identities/{GestaltIdentityKey} -> {json(GestaltIdentityInfo)}`
   */
  protected gestaltGraphIdentitiesDbPath: LevelPath = [
    this.constructor.name,
    'identities',
  ];

Now it is possible to actually make the gestaltGraphMatrixDbPath more efficient.

We could instead make use of the multilevels now and do something like this:

GestaltGraph/matrix/{GestaltKey}/{GestaltKey} -> null

This would enable us to manipulate each gestalt without having to load the entire JSON structure. The GestaltKeySet could still be used if we need to return the set...

It could also be applied to the gestaltGraphNodesDbPath and gestaltGraphIdentitiesDbPath as it's possible to store each property from GestaltNodeInfo and GestaltIdentityInfo directly into the DB.

CMCDragonkai commented 1 year ago

The purpose of the GestaltNodeInfo and GestaltIdentityInfo is to actually store the information required by social discovery. Discovery by itself is stateless as it discovers information this gets put into the gestalts DB.

The GestaltGraph also abstracts over the ACL in that it also changes the permissions as things are linked up. This is why it has things like setGestaltActionByNode and related.

In a way, this is because the ACL has to apply to whole gestalts and the ACL isn't aware of the gestalts, but the gestalts is aware of the ACL. This dataflow relationship was always a bit iffy, and it could be reversed if gestalt changes can be observed, and ACL permissions could subscribe to that... But that is to be solved later.

CMCDragonkai commented 1 year ago

Assuming we rename NodeInfo to GestaltNodeInfo and IdentityInfo to GestaltIdentityInfo...

These methods:

GestaltGraph.setNode
GestaltGraph.linkNodeAndNode

All take the NodeInfo.

One thing I don't like about this is that there's nothing ensuring consistency between the information in NodeInfo and the linkages or breakages in the GestaltGraph.

If the NodeInfo and IdentityInfo are meant to store information about their cryptolinks, then it is easy for these to become inconsistent. The NodeInfo stored may have no cryptolinks (no link claims) and yet still be linked up in the gestalt graph.

I think the original idea, is that for any link between the gestalt graph, there must be a corresponding cryptolink/claim that is also recorded by the gestalt graph.

This means rather than just storing the vertexes (and associating adjacency by position), we actually need to store their edges too. The edge information must be first-class and be derived or equal to the cryptolink claims.

CMCDragonkai commented 1 year ago

If we move to using:

GestaltGraph/matrix/{GestaltKey}/{GestaltKey} -> null

It would then be possible to store edge information where that null is.

However this information maybe duplicated.

GestaltGraph/matrix/A/B -> EdgeAB
GestaltGraph/matrix/B/A -> EdgeAB

We've solved this before by a point of indirection like in the ACL. A common ID where these vertex pairs may to an EdgeId that then provides additional information:

GestaltGraph/matrix/A/B -> 1
GestaltGraph/matrix/B/A -> 1
GestaltGraph/edges/1 -> EdgeInfo

Then it's also possible to GC the edges too, the edges are bidirectional anyway. So deleting any one vertex pair, also deletes the opposite pair.

This now gives us an opportunity to store information about each vertex, and also information about each edge.

This means we don't store an entire copy of the sigchain into each vertex's info. In fact, I'm not entirely sure what should be put in to the vertex info at this point in time. The NodeId or ProviderIdentityId provides us what would need to look up information. However we could also cache other details the vertexes.

Now the edges themselves can store information, and we can derive a structure from ClaimLinkNode and ClaimLinkIdentity.

We could store the entire claim, since they just contain information like:

{
  jti: ClaimIdEncoded;
  iat: number;
  nbf: number;
  seq: number;
  prevClaimId: ClaimIdEncoded | null;
  prevDigest: string | null;
  iss: NodeIdEncoded;
  sub: ProviderIdentityIdEncoded;
}

However, no signature data is available here. Should we store the signatures as well? I feel like we should be doing this.

That would then look like this:

{
  payload: {
    jti: ClaimIdEncoded;
    iat: number;
    nbf: number;
    seq: number;
    prevClaimId: ClaimIdEncoded | null;
    prevDigest: string | null;
    iss: NodeIdEncoded;
    sub: ProviderIdentityIdEncoded;
  },
  signatures: Array<TokenHeaderSignature>;
}

Now there is a problem similar to the Sigchain, storing these signed claims has an issue with storing the signature component. Unless we were storing them as SignedClaimEncoded, the TokenSignature is a buffer.

In the Sigchain, this is solved by specially encoding the signature component with JSON representation and back. The buffers do have a JSON representation, they are just a bit inefficient, since they are storing each byte as a number rather than using base encoding. And we are doing this until DB one day supports binary JSON like BSON or COSE which would obviate this need to do this, since the DB itself can recognise buffers and deal with them specially.

CMCDragonkai commented 1 year ago

If the edge info were to be stored directly into the DB without JSON encoding. It may be a quite complex due the nested structure of the claims. So I think we should stick with JSON in that case.

As for the node info and identity info, they can include IdentityData and NodeData information. The lastUpdated property of NodeData would be the same as in for the node graph? We don't know. The gestalt graph may need to store TTLs for the node vertexes, TTLs for identity vertexes, and TTLs for all the edges.

One thing about the edge info that does require a bit of change.

One problem is that link identity claims has additional metadata that cannot be in the claim itself. This is reflected by a new type:

/**
 * Identity claims wraps `SignedClaim<ClaimLinkIdentity>`.
 * The signed `claim` is what is published and also stored in the `Sigchain`.
 * Additional metadata `id` and `url` is provided by the identity provider.
 * These metadata properties would not be part of the signed claim.
 */
type IdentitySignedClaim = {
  id: ProviderIdentityClaimId;
  url?: string;
  claim: SignedClaim<ClaimLinkIdentity>;
};

This means, the edge info isn't just an SignedClaim<ClaimLinkIdentity>. It may require to be some additional structure like IdentitySignedClaim.

We may need to specialise edge types like we do with vertex types. So it could be something like:

type GestaltLinkId = Opaque<'GestaltLinkId', Id>;
type GestaltLinkIdString = Opaque<'GestaltLinkIdString', string>;

type GestaltLink = GestaltLinkNode | GestaltLinkIdentity;
type GestaltLinkNode = ...;
type GestaltLinkIdentity = ...;

We would have to ensure that we can differentiate the 2 kinds of links as well.