decentralized-identity / confidential-storage

Confidential Storage Specification and Implementation
https://identity.foundation/confidential-storage/
Apache License 2.0
78 stars 23 forks source link

Object de-duplication for storage and replication efficiency #98

Open tplooker opened 3 years ago

tplooker commented 3 years ago

In order to facilitate efficient modes of replicating objects between EDV instances a core optimisation would be enabling the capability for an EDV to appropriately identify duplicate objects. With solutions that deal with un-encrypted data this is a relatively straight forward exercise using techniques like hash based content addressing and is at the core of systems like git. However when the data being replicated is in an encrypted form, things get more complicated.

Why? Most encryption schemes on purpose do not produce deterministic ciphers for the same payload for security reasons. In layman's terms this means the exact same payload encrypted twice will yield two difference ciphers. Meaning w.r.t EDV's, in an instance where the same object is encrypted on two seperate occasions and inserted into two seperate EDV's that are connected via replication, the encrypted payload will be persisted on disk twice in both vaults. The vaults will also unnecessarily send a copy of the same encrypted object to one an other. The below diagram hopefully captures this.

Screen Shot 2020-08-14 at 8 21 46 PM

Potential Solutions

  1. Deterministic Encryption Schemes - Some approaches use Synthetic Initialization Vectors that derive the IV from HMACing the content to be encrypted hence creating a stable resulting cipher for the same content. See https://github.com/AGWA/git-crypt#security as an example.

  2. Using indexes - @OR13 raised this as a possible solution, @OR13 could you elaborate?

  3. Not caring - Essentially opting to live with the cases where duplication will occur and accepting the trade-offs that it creates.

I'm sure there are other possible solutions for this problem and @dlongley when we spoke briefly about Synthetic Initialization Vectors you raised some good questions around the security of content encryption ciphers when used in this mode.

Another observation w.r.t a deterministic encryption scheme, is that if the content encryption key is rotated (i.e the object is re-encrypted) then the cipher will change anyway.

OR13 commented 3 years ago

If you create an index on a property you use to handle depublication... you can see if you have duplicates.... for example:

{
 contentIdOfDocContent: '...',
 docContent:{}
}

index on contentIdOfDocContent equals.... select * from edv where contentIdOfDocContent = 'thing I want to make sure there is only one of'... QED.

OR13 commented 3 years ago

^ this approach implies app level data conventions... vs platform ones... which has both pro's and cons.

agropper commented 3 years ago

I was thinking along the same lines as Orie. From a privacy perspective, I presume that any index other than those intended to support scoped access into a resource would be in a separate secure data store with separate access control. I see little benefit to co-locating metadata with the data itself and many impediments to privacy by default.

By this logic, metadata that drives replication and deduplication could be in a separate secure data store. Synchronization, as in backup and restore, might use a different mechanism, tied primarily to what we’re calling the client rather than an index.

On Fri, Aug 14, 2020 at 9:11 AM Orie Steele notifications@github.com wrote:

^ this approach implies app level data conventions... vs platform ones... which has both pro's and cons.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/decentralized-identity/secure-data-store/issues/98#issuecomment-674067078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABB4YPPDUCSQL5SNZYZ2C3SAUZWNANCNFSM4P7IWJJA .

OR13 commented 3 years ago

@agropper the indexes are managed by the client today, and they are encrypted, here is an example

Encrypted

{
  "id": "z1A4JXqwz4wM1L3RrKG341Fou",
  "edvId": "6301871c-ae31-4d00-a4e3-73854baadfb7",
  "sequence": 0,
  "indexed": [
    {
      "hmac": {
        "id": "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a",
        "type": "Sha256HmacKey2019"
      },
      "sequence": 0,
      "attributes": [
        {
          "name": "2TBum3bOxJ3pREZiY5LD7UQMviNmMK-ow6Qn2IHmUSA",
          "value": "7d2PciAP7QFmJkZnLnGQll3lI6kiEgY7-8paGRZ3C1c",
          "unique": false
        },
        {
          "name": "3TCmOiJGFnsXlAaPjoZnt2mIPIVq6xiSeVJropbYDwU",
          "value": "dqe8AR_VPtd2vzwbNQRA3G7ya05DSHCQEMHylgOiB40",
          "unique": true
        }
      ]
    }
  ],
  "jwe": {
    "protected": "eyJlbmMiOiJYQzIwUCJ9",
    "recipients": [
      {
        "header": {
          "kid": "did:key:z6Mkf8unjmyqsnDtZAjZkdNhw3LZWm5x9u3bbHCEdenD1Agq#z6LShX3PmBwYHGh8JL82zm3x8uT3bWEbLmfos66McREoEfvo",
          "alg": "ECDH-ES+A256KW",
          "epk": {
            "kty": "OKP",
            "crv": "X25519",
            "x": "pK5QE4-dwpPdjejlB3VERU9XCy1t4xfa-JNUDVa9iVs"
          },
          "apu": "pK5QE4-dwpPdjejlB3VERU9XCy1t4xfa-JNUDVa9iVs",
          "apv": "ZGlkOmtleTp6Nk1rZjh1bmpteXFzbkR0WkFqWmtkTmh3M0xaV201eDl1M2JiSENFZGVuRDFBZ3EjejZMU2hYM1BtQndZSEdoOEpMODJ6bTN4OHVUM2JXRWJMbWZvczY2TWNSRW9FZnZv"
        },
        "encrypted_key": "DwOEbW0OvtnQaqL4gc6_9Za1vzHrrLptI_UsPsGWFoBlUcASWP5qWQ"
      }
    ],
    "iv": "Et_yCe5BAWtSiAm2H3GEh192zNQiNA4d",
    "ciphertext": "WC9zeH_Q90Z34VvX7Vsb2nK42qjZch2n-x2RweSjmVyVOxu__yAY870u5sRtaOjTPSNtxKoxHFNTbsVW2M5vlXTPStNtxdcGK8s2qPI_diR8E3E3pzqKr8iShZ2c3wuywILcgZWrdYlmzW9tcdBjLAnBdxbWdhqxwZNKLIu-11edpXA0KOra8qhK55mI8k_WUDTudV1w7aYVPFtngCwNy1hN4JsAGm1_NtB_WpXtua10oQ-PpP6d18i7c3jYCMZ56oaGCn5I1hf3yCO2OKgVJhxCsA2LzAu9gKxSm9ZPjhqrK5iRXUaE4lLWZNahgf_MRiNn5MDp7sN0GJ4IJFTs2On0_W6llwWgttkNiqtcsx48PiwlKgO2oimB0L7Y-bVpcinCpfDCK-UG6FGKaw7f1HsjWo4uthHdnCOm_Hw8dsSc7IPh0cORg4qbtAS4l_HDbPQroMlJIuLeOZqwMT55Ux32f3IfeVP5_1qitnigamOgHsfjuAV6ttKEgsEiDoAqa7kOQy_pB5jXkkJ57FURfKSG__hkbzm2L88djfaDAFFAz-7W0LvaEM4Dwew_-kAnoDJCBPa5MPG4W7MpXZhiafIuZsaD_Xk9OprHxFV_nXU8ztl0NKoc_H3Qg3l1D00wJQI_TWPqRfSqc5qHyRrh_TLRuTXpK2I2Hh3v-N_HrNotWN8p-McFnaV3cRtOMvLq44kF_X4_NPH8s7wYQ4yFkd2ffiFD",
    "tag": "TSiAPqHT6t0wT1rWJppicQ"
  }
}

Decrypted

{
  "id": "z1A4JXqwz4wM1L3RrKG341Fou",
  "edvId": "6301871c-ae31-4d00-a4e3-73854baadfb7",
  "sequence": 0,
  "content": {
    "schema": "https://schema.org/UniversalWallet",
    "data": {
      "@context": [
        "https://transmute-industries.github.io/universal-wallet/contexts/wallet-v1.json"
      ],
      "id": "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a",
      "name": "My Entropy",
      "image": "https://via.placeholder.com/150",
      "description": "For testing only.",
      "tags": [
        "inception"
      ],
      "correlation": [
        "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a"
      ],
      "type": "Entropy",
      "value": "b048e065450ddf6e5fe5db9fe0cd48e5215237e8cff0bcffc3eb0e0d7727c584"
    }
  }
}

In this example, the index is built on "schema": "https://schema.org/UniversalWallet",... so that wallets can use an edv client to easily get all the content that is for them in a given vault.... there is no need to split the index / meta data from the vault document... see the encrypted representation. even if the same resource were uploaded by 2 different parties, the storage provider would not know, because the indexes are built of keyed hashing functions and the keys are unique per client... this design is privacy by default.

dlongley commented 3 years ago

@OR13 -- Side note regarding the data model above: Did you consider putting schema in meta at the top-level and removing the need for data like this?:

{
  "id": "z1A4JXqwz4wM1L3RrKG341Fou",
  "edvId": "6301871c-ae31-4d00-a4e3-73854baadfb7",
  "sequence": 0,
  "meta": {
    "schema": "https://schema.org/UniversalWallet"
  },
  "content": {
    "@context": [
      "https://transmute-industries.github.io/universal-wallet/contexts/wallet-v1.json"
    ],
    "id": "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a",
    "name": "My Entropy",
    "image": "https://via.placeholder.com/150",
    "description": "For testing only.",
    "tags": [
      "inception"
    ],
    "correlation": [
      "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a"
    ],
    "type": "Entropy",
    "value": "b048e065450ddf6e5fe5db9fe0cd48e5215237e8cff0bcffc3eb0e0d7727c584"
  }
}
agropper commented 3 years ago

@OR13 - I have a negligible understanding of encrypted index technology, but I am completely happy to stipulate that encrypted indexes are of great value for many use-cases. I am not arguing against encrypted indexes.

What I am proposing, is that indexes, encrypted, unencrypted, or hybrid SHOULD be hosted separately from the secure data stores and they should have a separate access control mechanism. For example, I can think of numerous use-cases where one secure data store might want to be indexed in 20 different places. Imagine a cancer patient with their tumor genome and health record and 20 different research communities around the world, each with their own grants and Institutional Review Board to satisfy.

I'm glad we are working on encrypted data vaults for reasons that Datashards and others articulate quite well. The same reasons probably apply to the associated indexes. I just don't see the value in presuming the two are co-located under the same access control domain.

On Fri, Aug 14, 2020 at 12:15 PM Dave Longley notifications@github.com wrote:

@OR13 https://github.com/OR13 -- Side note regarding the data model above: Did you consider putting schema in meta at the top-level and removing the need for data like this?:

{ "id": "z1A4JXqwz4wM1L3RrKG341Fou", "edvId": "6301871c-ae31-4d00-a4e3-73854baadfb7", "sequence": 0, "meta": { "schema": "https://schema.org/UniversalWallet" }, "content": { "@context": [ "https://transmute-industries.github.io/universal-wallet/contexts/wallet-v1.json" ], "id": "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a", "name": "My Entropy", "image": "https://via.placeholder.com/150", "description": "For testing only.", "tags": [ "inception" ], "correlation": [ "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a" ], "type": "Entropy", "value": "b048e065450ddf6e5fe5db9fe0cd48e5215237e8cff0bcffc3eb0e0d7727c584" }}

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/decentralized-identity/secure-data-store/issues/98#issuecomment-674150617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABB4YN4SKW4R3D6VUZNFRDSAVPKXANCNFSM4P7IWJJA .

tplooker commented 3 years ago

@OR13 just so I am clear, you are proposing to create an encrypted index thats essentially a hash of the schema attribute in a document. Does this mean I could only store one instance of a https://schema.org/UniversalWallet object?

+1 to @dlongley tweak's to your example

dlongley commented 3 years ago

@tplooker,

just so I am clear, you are proposing to create an encrypted index thats essentially a hash of the schema attribute in a document. Does this mean I could only store one instance of a https://schema.org/UniversalWallet object?

I believe what he's showing isn't a proposal but is actually a presently functional implementation. But that aside -- the encrypted indexes in EDVs today use HMACs on cleartext attributes and values. What you're seeing above in the encrypted document is the output of that process (and is what the server sees). These indexes can be unique or non-unique (and compound or simple). With respect to the schema, that would be in a non-unique index. But you could have an identifier in your data that is unique, and the server can use that to ensure there is only one copy of it within a given EDV. You may want to model that unique index as a compound index with schema + the unique ID field depending on your use case.

OR13 commented 3 years ago

@dlongley +1 to your proposal, I will go one further:

{
  "id": "z1A4JXqwz4wM1L3RrKG341Fou",
  "edvId": "6301871c-ae31-4d00-a4e3-73854baadfb7",
  "sequence": 0,
  "meta": {
    "schema": "https://schema.org/UniversalWallet",
    "CID": "...",
    "VECTOR_CLOCK": "..."
  }, 
  "content": {
    "@context": [
      "https://transmute-industries.github.io/universal-wallet/contexts/wallet-v1.json"
    ],
    "id": "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a",
    "name": "My Entropy",
    "image": "https://via.placeholder.com/150",
    "description": "For testing only.",
    "tags": [
      "inception"
    ],
    "correlation": [
      "urn:digest:57965c6f3f0ce806884e2cfa7d539e4046b1e1c719b2510d947d3c2b57b4388a"
    ],
    "type": "Entropy",
    "value": "b048e065450ddf6e5fe5db9fe0cd48e5215237e8cff0bcffc3eb0e0d7727c584"
  }
}

Where:

    "CID": "...",
    "VECTOR_CLOCK": "..."

Might also be exposed in the JWE header for hubs.... but where they can remain private for EDVs....

OR13 commented 3 years ago

Relevant to JOSE / Content ID... https://github.com/ceramicnetwork/CIP/issues/59#issuecomment-674226615

dlongley commented 3 years ago

See this http://pl.atyp.us/wordpress/index.php/2010/03/conflict-resolution/ about how just having the vector clock (without the rest of the update/patch information) may not be sufficient for many use cases. I'm not sure how much "..." is meant to cover for the value of "VECTOR_CLOCK" in the example above. We may have to accept limitations of this approach given how much/little information we can express in the clear. I didn't fully understand this:

Might also be exposed in the JWE header for hubs.... but where they can remain private for EDVs....

I'm not quite sure what "private for EDVs" means. If something is exposed in the JWE headers -- how does it "remain private"? Private from whom? Could you elaborate a bit? I couldn't follow this last bit.

OR13 commented 3 years ago

@dlongley the question is whether hubs need plaintext metadata or not... if they do, it can only go in the JWE header... the meta and content properties are encrypted, and so are indexes on them... see https://github.com/decentralized-identity/secure-data-store/issues/97