Closed SmithSamuelM closed 3 years ago
Unlike schema.org and JSON-LD, JSON schema provide convenient support for completely self contained custom schema and embedded custom schema via schema metadata fields. The emphasis is on the word convenient. Any field in a JSON schema that is not included in the properties field is metadata. This allows a given schema definition to be self-describing. More importantly this allows a schema definition to include meta data that may be cryptographically committed to via a cryptographic digest of the whole schema including its meta-data fields. When the schema identifier used externally is a digest of all or part of the schema then that identifier provides a way to verify the integrity of any copy of the schema thereby ensuring immutability.
A version field in the meta data allows an immutable commitment to a given semantic version but the actual version is the digest. Different digest means different version.
For example
{
"$id": "did-of-schema-itself",
"$schema": "did-of-schema-repository",
"title": "Human Friendly title of schema",
"version": "1.0.0",
"type": "object",
"properties":
{
}
}
The advantage of a content addressable identifier is that it is cryptographically bound to the contents. It provides a secure root-of-trust. Any cryptographic commitment to a content addressable identifier is functionally equivalent (given comparable cryptographic strength) to a cryptographic commitment to the content itself.
A self-addressing
identifier is a special class content-addressable identifier that is also self-referential. The special class is distinguished by a special derivation method or process to generate the self-addressing
identifier. This derivation method is determined by the combination of both a derivation code prefix included in the identifier and the context in which the identifier appears. The reason for a special derivation method is that a naive cryptographic content addressable identifier must not be self-referential, i.e. the identifier must not appear within the contents that it is identifying. This is because the naive cryptographic derivation process of a content addressable identifier is a cryptographic digest of the serialized content. Changing one bit of the serialization content will result in a different digest. A special derivation method or process is required.
The process assumes that the fields in a JSON serialization are ordered in stable round trippable reproducible order. This is property creation order or otherwise known as field insertion order. This logical ordering has been supported by the stringify()
method in JavaScript with a custom reordering function using Reflect.ownPropertyKeys() since ES6 and natively without a reordering function by stringify()
since ES11. Property creation order is a natural canonicalization as it supports any predefined logical ordering by one choosing to create the properties in that predefined order. Other languages have long supported ordered dictionaries or mappings that support insertion order of the fields when serializing the mapping to/from JSON. The canonical serialization problem for JSON is now a solved problem.
The self-addressing identifier derivation process is as follows:
id
field in the content that will hold the self-addressing identifier with a dummy string of the same length as the eventually derived self-addressing identifierid
fieldAs long as any verifier recognizes the derivation method, the 'self-addressing` identifier is a cryptographically secure commitment to the contents in which it is embedded. It is a cryptographically verifiable self-referential content addressable identifier.
Because a self-addressing identifier is both self-referential and cryptographically bound to the contents it identifies, anyone can validate this binding if they follow the binding protocol outlined above.
To elaborate, this approach of deriving self-referential identifiers from the contents they identify, we call self-addressing
. It allows a verifier to verify or re-derive the self-referential identifier given the contents it identifies. To clarify, a self-addressing identifier is different from a standard content address or content addressable identifier in that a standard content addressable identifier may not be included inside the contents it addresses. The standard content addressable identifier is computed on the finished immutable contents and therefore is not self-referential.
Schema typically include within their contents, a self-referential identifier that is not derived from the contents. This poses a problem because there now may be two different identifiers for the same content. The first is the self-referential but not cryptographically bound identifier included in the content and the second is the cryptographically bound but not included in the content identifier. When reasoning about the content the existence of a non-cryptographically bound identifier is a security vulnerability. Certainly the self-referential identifier may not be used on its own to securely reason about the content because it is not bound to the content. Anyone can place such an identifier inside a different schema and claim that the different schema is the correct schema for that identifier. But a standard content addressable identifier may not be included in the schema itself. It must be tracked independently. Whereas a self-addressing identifier is included in the content. Making the schema fully self-contained with a cryptographically bound identifier.
We may apply the derivation process above to the $id field in a JSON Schema. Thus the schema identifier itself will be derived from all the content of the schema except for the $id field itself. To replay the process: first replace the value of the $id field with a string filled with dummy characters of the same length as the eventual derived value for $id. Then make a digest of the serialized schema contents that include the dummy value for $id. Then substitute the derived identifier for the dummy identifier into the schema contents. Thus the $id of the schema will be strongly bound to the contents of everything in the schema except for the identifier itself in a universally verifiable way. This derivation approach allows a self-referential identifier embedded in the schema to also be cryptographically bound to the contents of that schema.
Thus only one identifier need ever be used to securely reason about the schema and that is the self-addressing one. This also makes the schema immutable with respect to its self-addressing identifier. Any change in the schema and the identifier will no longer be derivable from that schema. This provides a secure root-of-trust for reasoning about immutable schema.
This issue was address in PR #4
JSON with Immutable JSON Schema
The current VC examples include an @context. This will be removed. Either the @context is ignored which makes it pointless or if not ignored makes the VC insecure. In either case @context should not be used.
Problems with Linked Data based VCs
Linked data, i.e. JSON-LD based verifiable credentials require an @context. But this requirement makes them inherently insecure because the @context points to mutable schema. In general @context schema are derivatives or extensions to schema.org schema. Schema.org schema are mutable. The identifiers of schema.org schema do not have any inherent way to enforce immutability. Given such externally referenced mutable schema, a verifiable signature on a linked data verifiable credential does not ensure either the integrity nor the authenticity of the credential . This means that the over-the-wire VC may be subject to exploits that do not require compromise of the signing keys or the signature verification code. In other words, merely protecting private keys and signature verification code may be insufficient to guarantee security and secure verifiability. This may greatly expand the attack surface and therefore a serious design flaw for any linked data based verifiable credential system.
There are a couple of workarounds that one might suggest to address the mutability of the schema.org derived schema. One is to cache the schema and then use a hash link to reference the cashed schema. A hash link includes a cryptographic digest of the associated resource or data. But a hash link by itself is insufficient to ensure verifiability of the schema. The schema itself may mutate and the hash link merely makes that mutation detectable. The credential itself is still not verifiable unless the un-mutated schema is available someplace. This then imposes a forever constraint on the availability of the cached schema. An attacker merely needs to attack schema.org to invalidate all linked data based VCs or attack any local caches. We believe local caches to be an unreasonably cumbersome approach. And reliance on schema.org an unreasonably insecure approach. The root problem is using schema.org in the first place which was never designed to provide immutable schema for secure verifiable credentials but was designed for the now largely obsolete semantic web.
Another workaround is to not sign the over-the-wire schema but to sign the serialized fully expanded RDF graph that results from processing the VC with its schema. In this case a signature is signing an immutable representation of the VC. However these inverts the security layering such that one must process the payload (VC) before verifying the integrity and authenticity of that data. The RDF expansion stack now becomes part of the attack surface. Not to mention the performance hit associated with the expansion process. Moreover there are better more recent more widely adopted technologies for creating semantic graphs that makes RDF expansion a poor excuse for weakened security and performance degradation that results from signing the expanded not the over-the-wire VC. RDF was designed for the largely obsolete semantic web. Given that VCs are not meant for the semantic web, it seems a poor technology choice to burden VC with the largely obsolete technology that is linked data especially when there are more secure, widely adopted, powerful and performant alternatives.
GLEIF Approach
Consequently GLEIF is building its eco-system on secure immutable schema using the universally adopted JSON schema system but where those schema are secured by using content addressable cryptographic digest schema identifiers. Those schema may be provided in one of two secure ways. The first is to embed the schema in the verifiable credential itself thereby removing any availability issues. The second is to provide the schema via one or more highly available registries that ensure the immutable schema are indefinitely available. These verifiable data registries may be the same registries that host the issuance and revocation registries for the verifiable credentials themselves. The issuance and revocation registries must be both highly and indefinitely available. Thus no new meaningfully different burden is placed on the VC system to also support highly available immutable schema. The immutable schema registries also provide explicit extensible interoperability of schema type definitions as is the industry standard approach used by IANA class registries. The difference is that the schema type definitions for VCs must be immutable and any change to type definition requires a new identifier for that new type definition. Anything less exposes the VC system to security vulnerabilities.