decentralized-identity / confidential-storage

Confidential Storage Specification and Implementation
https://identity.foundation/confidential-storage/
Apache License 2.0
80 stars 23 forks source link

Add replication related terminology definitions. #94

Closed dmitrizagidulin closed 4 years ago

dmitrizagidulin commented 4 years ago

Partly addresses issue #21, will help support replication related use cases.

dmitrizagidulin commented 4 years ago

I'm concerned that we're introducing too much terminology here.

Always a valid concern. Although I would certainly ask - when was the last time you as a developer complained that a spec had too much terminology definition, versus not enough? I only hear complaints about stuff left undefined.

But anyways, would it help if I removed the standalone definition of unidirectional vs bidirectional, and full vs realtime sync, and just pulled those explanations into the overall definition of replication?

msporny commented 4 years ago

Although I would certainly ask - when was the last time you as a developer complained that a spec had too much terminology definition?

All the time!!! I actively scan specs for terminology that we can eliminate. As a general rule, if we use the term less than 5 times in the spec, it needs to be eliminated. :)

Look at JSON-LD 1.1's terminology list as an example of "too much":

https://www.w3.org/TR/json-ld11/#json-ld-specific-term-definitions

Compare with HTML5, which is way more complex but contains fewer terms:

https://www.w3.org/TR/html52/infrastructure.html#infrastructure-terminology

and also compare to RDF (which does a pretty good job of limiting terminology used):

https://www.w3.org/TR/rdf11-concepts/#section-rdf-graph

We should have just enough to make the spec work, nothing more. Boo to defining terminology that's only used in a tiny subsection.

But anyways, would it help if I removed the standalone definition of unidirectional vs bidirectional, and full vs realtime sync, and just pulled those explanations into the overall definition of replication?

Yes, that would be a start.

dmitrizagidulin commented 4 years ago

@msporny

All the time!!! I actively scan specs for terminology that we can eliminate.

Heh, that's an answer from a spec writer's perspective, not a developer's. I have never heard developers complain about too much spec. Only not enough.

msporny commented 4 years ago

Heh, that's an answer from a spec writer's perspective, not a developer's. I have never heard developers complain about too much spec. Only not enough.

No, this is from a developers perspective. As a developer, I don't like entering into ecosystems that:

  1. Have their own tribal language that you have to learn, or even worse,
  2. Repurposes common English definitions with subtly different technical definitions (e.g. "public key") that when misunderstood, cause damage to the ecosystem.

So, from a developer perspective... explain it like I'm a freshman in college, and if you can't do that, you don't have your act together and haven't really figured out how to explain your technology to the masses.

dmitrizagidulin commented 4 years ago

@msporny Ok -- having only clear explanations in the spec is a laudable goal, I can totally get behind that!

Have their own tribal language that you have to learn, or even worse, Repurposes common English definitions with subtly different technical definitions

Ok, so this is definitely a red flag. For one, you agreed to that definition of synchronization, on the editors call. (Not that it's a binding contract, I'm just curious what changed.)

Secondly, and much more importantly, do you feel that we're using either super technical definitions (our language says stuff like 'copying' and 'resolving edit conflicts' -- as basic as you can get), or overriding existing technical terms?

msporny commented 4 years ago

Ok, so this is definitely a red flag. For one, you agreed to that definition of synchronization, on the editors call. (Not that it's a binding contract, I'm just curious what changed.)

Ah! You think I'm arguing against the PR... I'm not -- merge it.

I started off by saying "I'm concerned...", which is not "I object" :)

I'm hand wringing in general, over an opportunity to reduce the number of terms in the terminology section. Maybe we can't do it and it's best for all 3 terms to exist. I doubt it, but am willing to defer to you and see where it goes. :)

Secondly, and much more importantly, do you feel that we're using either super technical definitions (our language says stuff like 'copying' and 'resolving edit conflicts' -- as basic as you can get), or overriding existing technical terms?

This was just me countering your assertion with an explanation that I was making a statement from a spec writer's perspective instead of a developers perspective. It doesn't have a ton of relevance to the terminology we're using. It may have relevance if we choose to define syncrhonization as something like: "Using a CRDT data structure to ensure..." <-- that's overloading the English language to make the word synchronization mean something different from what most folks reading the word would understand.

We're good here, merge if you feel that 3 terms is better than 2 at this point.

agropper commented 4 years ago

I like the perspective on conflict resolution and hope to understand it in terms of SSI. Conflict can arise from:

From a user perspective, the distinction between documents and resources seems unhelpful unless I define documents to be strictly immutable as in an executed contract or a film. Documents can then be either encrypted or not and replication is just about reliability and read-only access.

Adrian

On Fri, Aug 7, 2020 at 11:24 AM Orie Steele notifications@github.com wrote:

@OR13 requested changes on this pull request.

I object to this PR in absence of language referring to the existing data models which are known.

There is no concept of "conflict resolution" because there is no data model for encoding deltas in documents.

In the interest of merging this as WIP, I suggest we note that Documents and "Resources" can be replicated and synchronized, and we attempt to define Resource in this PR, to avoid perpetuating confusion over hub data model vs edv data model.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/decentralized-identity/secure-data-store/pull/94#pullrequestreview-463401434, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABB4YIN7FXTOAFLKGE4LKLR7QMCJANCNFSM4POEVUGA .

OR13 commented 4 years ago

@agropper the distinction between documents and resources is fundamental to understanding Hubs vs EDVs... and there are security and privacy concerns related to the distinction.

EDV Concepts

A Document has a known data model: https://identity.foundation/secure-data-store/#structureddocument A Document has a known authorization framework: ZCAP + HTTP Signatures (with the possibility to support others via HTTP Headers....) A Document exposes no meta data to the storage provider.

Hub Concepts

A Resource has no known data model.... A Resource is made of CRDT deltas (which also have no known structure). A Resource has no known authorization framework. A Resource has no known interfaces. A Resource MAY expose meta data related to the CRDT functioning that MAY trade privacy for performance....

WG Proposals

As a WG we want to to define replication for EDVs... personal opinion... we don't care to describe "synchronization" for EDVs...

As a WG we want to define "synchronization" for Hubs... and again, I think hubs don't actually care about "replication" of encrypted data... since they are privileged, in the sense that they can see some plaintext and are liable under GDPR because of this.... so again, synchronization has several differences in privacy and security concerns than "replication"... this PR starts us down the path of defining those differences.

msporny commented 4 years ago

As a WG we want to to define replication for EDVs... we don't care to describe "synchronization" for EDVs.

+1

OR13 commented 4 years ago

Here is the beginning of a proposal that would allow us to formally describe a "Resource": https://github.com/decentralized-identity/secure-data-store/issues/97

agropper commented 4 years ago

On Fri, Aug 7, 2020 at 12:30 PM Orie Steele notifications@github.com wrote:

@agropper https://github.com/agropper the distinction between documents and resources is fundamental to understanding Hubs vs EDVs... and there are security and privacy concerns related to the distinction. EDV Concepts

A Document has a known data model: https://identity.foundation/secure-data-store/#structureddocument A Document has a known authorization framework: ZCAP + HTTP Signatures (with the possibility to support others via HTTP Headers....) A Document exposes no meta data to the storage provider.

This I can maybe understand. It describes a policy enforcement point (PEP). As @Manu says, it benefits from replication. Calling it an EDV is a matter of taste, or bikeshedding, but the concepts seem clear except for the "no metadata" point.

In the absence of metadata, policy enforcement is restricted. I can understand why blinding the PEP to the contents may be useful (security, censorship, liability) but the inclusion of metadata is a choice and some metadata, such as an index into the data model or the last time replication succeeded could be useful for policy enforcement, billing, etc... without compromising security, censorship, or liability.

Hub Concepts

A Resource has no known data model.... A Resource is made of CRDT deltas (which also have no known structure). A Resource has no known authorization framework. A Resource has no known interfaces. A Resource MAY expose meta data related to the CRDT functioning that MAY trade privacy for performance....

This is totally confusing to me. A resource could be storage or agent, in the sense that either can benefit from access to plaintext and metadata. An agent is just policy storage + policy decision code designed to control access to a document (as above).

I don't understand what anything with "no known authorization framework" is. Can you give an example?

I don't understand why we're talking about "no known interfaces". Are you using "known" in the biblical sense?

CRDT has value. So does separation of PEP from PDP. Per issue #97, I control an agent acting as a PDP and it can synchronize using CRDT. The result is a modified document or a new document in one or another PEPoints.

The metadata issue is confusing. It seems to me that metadata is essential to the PDP and helpful to the PEP. From a privacy perspective, as long as the PDP controls access to the metadata we're set. Of course, I may be misunderstanding the definition of metadata in this context.

WG ProposalsAs a WG we want to to define replication for EDVs... personal opinion... we don't care to describe "synchronization" for EDVs...

As a WG we want to define "synchronization" for Hubs... and again, I think hubs don't actually care about "replication" of encrypted data... since they are privileged, in the sense that they can see some plaintext and are liable under GDPR because of this.... so again, synchronization has sever differences in privacy and security concerns than "replication"... this PR starts us down the path of defining those differences.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/decentralized-identity/secure-data-store/pull/94#issuecomment-670601609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABB4YMMQCO7T4Z4YSKZYQLR7QT2NANCNFSM4POEVUGA .

OR13 commented 4 years ago

@agropper I think your point is that hubs are not well documented... thats my point as well.

When I say no "known interface" I mean the answer to any question about the behavior from me will be: "undefined / what do you want it to be ; )" ?

So in short, I can't easily answer your questions, because for Hubs I don't know the answers.

I think your clarification regarding PDP and PEP and meta data is exceedingly helpful... let me try and put it into the terminology we are attempting to define....

Replication meta data will exist for EDVs and Hubs... we don't know where exactly, but the Vault Config / Document Objects have placed we could place it for EDVs... Hubs doesn't have any defined objects currently.

Synchronization meta data might not get used by EDVs, it will get used by Hubs... and it might go in the Vault Config / Document Object...

Authorization meta data will exist for EDVs and Hubs... we know where it goes in Vault Configs / Documents (ZCAPs / keyAgreement / JWE recipients).... we don't know how this will work for hubs....

Both replication and synchronization require some concept of "authorization"... so its dangerous to describe them without considering this... and especially dangerous to describe them given how not defined the hubs side of the spec is.... what we are trying to work towards, is terminology that can cover the conceptual desires described by hub use cases, and then to make more concrete proposals for how to achieve them... one such technical proposal is here: https://github.com/decentralized-identity/secure-data-store/issues/97

OR13 commented 4 years ago

@dmitrizagidulin any last minute changed you want to approve before we merge this?

dmitrizagidulin commented 4 years ago

@OR13 let's merge it; I'll prep the next PR to incorporate some of the feedback from this one, next.