SIP 1 - Efficient Operation Mapping

csuwildcat commented 4 years ago

  SIP: 1
  Upgrade-Type: Hard Fork (to guarantee outcomes)
  Title: Efficient DID Operation Mapping
  Author: Daniel Buchner <daniel.buchner@microsoft.com>
  Comments-Summary: No comments yet.
  Comments-URI: https://github.com/decentralized-identity/sidetree/sips/1.md
  Status: Draft
  Created: 2020-06-23

Summary

By segregating the proving data contained in the operation entries currently housed in the Anchor File and Map File (for Recovery, Deactivate, and Update operations), it is possible to realize a rather dramatic ~75% reduction in the minimum dataset required to trustlessly resolve DIDs.

The effect of moving this data to a segregated Proving File is that the Anchor File and Map Files become lightweight, spam-protected operation indexes, allowing for deferred acquisition of Proving Data in a JIT fashion, for nodes of various configurations.

Motivation

These changes would make initialization of many node types faster, more efficient, and most importantly: operationally feasible for the average user-operator. Sustainable operation of nodes across consumer hardware is a key requirement for any decentralized network of this class, thus keeping network storage growth comfortably 'under the line' of the commodity storage technology cost curve and bandwidth growth curves is essential. While such curves lack precision, when one examines the trajectory of storage and bandwidth in reference to the waning cadence of the Kryder's Law and Edholm's Law doubling conjectures, it appears that a 2-3 terabytes per annum growth in the size of the minimum required dataset for a network is the top end of sustainability for a system that features peer-based replication of data and deferral of CPU intensive tasks until a JIT compilation/resolution phase.

Requirements

Target a minimum required dataset upper limit of 2-3 terabytes, assuming one year at a rate of 1000 operations per second.
Push as much data out of the primary indexing files (Anchor and Map Files) as possible.

Technical Proposal

The primary technical changes center around moving proving data out of the Anchor File and Map File, leaving those files to act as bare minimum indexes that enable a node to have global awareness of possible operations for any DID in the system. The proposed changes include the addition of two new intermediary files between the Anchor and Chunk Files. All changes to the existing Anchor and Map Files, as well as the new Proving Files, are as follows:

Anchor File

The Anchor File would be modified in the following ways:

Add a new CAS URI link to a Retained Proving File, which contains the signed operation data that used to exist in the recover and deactivate operation entries.
Add a new CAS URI link to a Transient Proving File, which contains the signed operation data that used to exist in the update operation entries of the Map File.
Modify the create operation across the spec to reflect the fact that the reveal_value is the hash of the hash of the JWK value that is being committed to.
Modify the recover and deactivate operation entries to only include the did_suffix and reveal_value properties. The reveal_value is the hash of the hash of the JWK in the signed_data object that was relocated to the Retained Proving File.

{
  "retained_proving_file": CAS_URI,
  "transient_proving_file": CAS_URI,
  "map_file": CAS_URI,
  "writer_lock_id": OPTIONAL_LOCKING_VALUE,
  "operations": {
    "create": [
      {
        "suffix_data": { // Base64URL encoded
          "delta_hash": DELTA_HASH,
          "recovery_commitment": COMMITMENT_HASH
        }
      },
      {...}
    ],
    "recover": [
      {
        "did_suffix": SUFFIX_STRING,
        "reveal_value": MULTIHASH_OF_JWK
      },
      {...}
    ],
    "deactivate": [
      {
        "did_suffix": SUFFIX_STRING,
        "reveal_value": MULTIHASH_OF_JWK
      },
      {...}
    ]
  }
}

Map File

The Map File would be modified in the following ways:

Modify the update operation entries to only include the did_suffix and reveal_value properties. The reveal_value is the hash of the hash of the JWK in the signed data object that was relocated to the Transient Proving File.

{
  "chunks": [
    { "chunk_file_uri": CHUNK_HASH },
    {...}
  ],
  "operations": {
    "update": [
      {
        "did_suffix": DID_SUFFIX,
        "reveal_value": MULTIHASH_OF_JWK
      },
      {...}
    ]
  }
}

Retained Proving File

The Retained Proving File will contain the following:

The signed_data portion of the recover and deactivate operation entries that used to live in the Anchor File are now present in the operations object under their respective properties, and MUST be ordered in the same index order their corresponding entries are present in the Anchor File.

{
  "operations": {
    "recover": [
      {
        "signed_data": { // Base64URL encoded, compact JWS
          "protected": {...},
          "payload": {
            "recovery_commitment": COMMITMENT_HASH,
            "recovery_key": JWK_OBJECT,
            "delta_hash": DELTA_HASH
          },
          "signature": SIGNATURE_STRING
        }
      },
      {...}
    ],
    "deactivate": [
      {
        "signed_data": { // Base64URL encoded, compact JWS
          "protected": {...},
          "payload": {
            "did_suffix": SUFFIX_STRING,
            "recovery_key": JWK_OBJECT
          },
          "signature": SIGNATURE_STRING
        }
      },
      {...}
    ]
  }
}

Transient Proving File

The Transient Proving File will contain the following:

The signed_data portion of the update operation entries that used to live in the Map File are now present in the operations object under their respective properties, and MUST be ordered in the same index order their corresponding entries are present in the Map File.

{
  "operations": {
    "update": [
      {
        "did_suffix": DID_SUFFIX,
        "signed_data": { // Base64URL encoded, compact JWS
          "protected": {...},
          "payload": {
            "update_key": JWK_OBJECT,
            "delta_hash": DELTA_HASH
          },
          "signature": SIGNATURE_STRING
        }   
      },
      {...}
    ]
  }
}

Operation Data Changes

Commitments are now the hash of the hash of the JWK revealed values, vs just the hash, as they are currently.
The revealed values in the Anchor and Map Files are the hash of the JWK, not the JWK itself, as they are currently.

OR13 commented 4 years ago

@thehenrytsai @Therecanbeonlyone1969 any idea how this growth rate stacks up to bitcoin/ethereum growth rate? obiviously those ledgers do stuff other than DIDs as well, but would be interesting to put "requirements" in the context of other real world production systems.

OR13 commented 4 years ago

should we consider eliminating the base64url encoding at the same time to stretch the storage gain to the limit?

troyronda commented 4 years ago

Suggest renaming "transient" - as the eventual meaning is that it is could be pruneable after checkpoints rather than transient at the current time.

tplooker commented 4 years ago

Suggested alternative syntax for anchor file

{
  "map_file": CAS_URI,
  "writer_lock_id": OPTIONAL_LOCKING_VALUE,
  "operations": {
    "create": [
      {
        "file_ref": CAS_URI,
        "suffix_data": { // Base64URL encoded
          "delta_hash": DELTA_HASH,
          "recovery_commitment": COMMITMENT_HASH
        }
      },
      {...}
    ],
    "recover": [
      {
        "file_ref": CAS_URI,
        "did_suffix": SUFFIX_STRING,
        "reveal_value": MULTIHASH_OF_JWK
      },
      {...}
    ],
    "deactivate": [
      {
        "file_ref": CAS_URI,
        "did_suffix": SUFFIX_STRING,
        "reveal_value": MULTIHASH_OF_JWK
      },
      {...}
    ]
  }
}

file_ref could actually be a JSON pointer CAS_URI

*Feedback

Does not achieve the space saving goals, the same CAS_URI would be repeated per operation

troyronda commented 4 years ago

I think enabling checkpoints and pruning is important, so I think a structure that enables that aspect is useful.

csuwildcat commented 4 years ago

Just want to note that the current file structures already implicitly support the addition of a checkpoint/pruning mechanism. This is about reducing the minimum dataset required to run a light node by ~75+%.

OR13 commented 4 years ago

I'm generally in favor of this proposal, but I'm a bit worried about how we go about implementing it.

Here is my proposal:

We inventory the set of features for which we believe we are shipping support for in spec v1.

We determine what level of testing is required to believe that the feature is supported in spec v1.

We create issues to ensure those tests exist in the reference implementation.

We close those issues when the tests exist.

We publish spec v1 and reference implementation and we bump to v1.1.

We open issues for the core set of features in v1.1 ( probably the same as v1).

We close those issues when we have tests that prove that they work.

We publish spec v1.1 and reference implementation.

Vendors that don't have production customers can choose to skip spec v1, and jump to v1.1... vendors who can't "wipe their production database" can use spec v1, until spec v1.1 is ready to migrate too.

We target SIP-1 to spec v1.1.

OR13 commented 4 years ago

We need to be careful to have a stable, rigorous, and confidence building release process, and versioning system, and I think its dangerously confidence destroying to rewrite versions and refuse to publish, vs choosing to publish regular versions with clear changes, tests and documentation to support each release. (our reference implementation does a good job of this... we need to ensure the spec does as well).

csuwildcat commented 4 years ago

@OR13 how about we cut an official version of the spec, as it stands now, to 0.1.0, and use this change as an opportunity to do a proper minor version bump of the spec in accordance with the version descriptions in the spec.

OR13 commented 4 years ago

I'm fine as long as we cut a version before we attempt to implement a sip. ideally, we try and make it as clean a version as we can, by closing out any low hanging fruit before the cut.

OR13 commented 4 years ago

it can be v0.1.0 and SIP-1 can target v0.2.0 or whatever... features should be planned to target versions...

csuwildcat commented 4 years ago

Aside: are folks here OK if I do a PR to add this general SIP template as a start for that sort of thing? I was thinking to create a SIP directory with MD files in it that would render just like our specs do.

csuwildcat commented 4 years ago

@tplooker I don't think the pointer URI to a place inside the linked file is worth it if we can do the same thing via a 0-byte alternative, given it degrades the primary goal of SIP 1. However, if we changed our mind about, we could always add it later in a way that Sidetree-based implementations could push out via a rather straightforward upgrade.

csuwildcat commented 4 years ago

@troyronda and others: if we don't want to go with Transient, what are some names for the files that will be cyclically eliminated after checkpoint pruning occurs?

tplooker commented 4 years ago

To further optimize the above proposal, we could remove an additional base64 encoding of suffix_data if we instead relied on using JCS to canonicalize the structure

OR13 commented 4 years ago

lets take the encoding performance debate to https://github.com/decentralized-identity/sidetree/issues/781

any tests / proof for the "75%" reduction claim being made here?

csuwildcat commented 4 years ago

@OR13 here's the test: the entries with proving data was 275 bytes, and the new size of the entries without proving data is 65 bytes, which is a reduction of 76.5% in the minimum required for a node to boot up and have a global index of all op entries.

OR13 commented 4 years ago

^ nice test, you must code a lot ; )

troyronda commented 4 years ago

I noticed the new file fields end with _file, but chunk ends with _file_uri e.g., retained_proving_file vs chunk_file_uri

Should the chunk field be chunk_file?

sandrask commented 4 years ago

@csuwildcat Should reveal_value in recover, deactivate and update operations be just the hash of JWK instead of the multihash of the hash of the JWK? This way our check for operation commitment would stay the same: multihash(reveal_value) == operation commitment

thehenrytsai commented 4 years ago

Fully implemented.

decentralized-identity / sidetree