decentralized-identity / confidential-storage

Confidential Storage Specification and Implementation
https://identity.foundation/confidential-storage/
Apache License 2.0
79 stars 23 forks source link

Add application-level use cases to drive Layer conversation #60

Closed csuwildcat closed 1 year ago

csuwildcat commented 4 years ago

Proposed expansion of application-level use cases:

Alice wants to publish items for sale to her SDS that anyone can view

Alice is moving and wants to sell some of her stuff. She uses an app to enter items, including pictures, descriptions, and prices, which she posts as objects in her personal datastore. These objects intentionally expose their semantic type, such as Offers, as well as their title and all the other details she added to them that possible buyers would want to know.

Bob wants to buy some items, some of which Alice could be selling. He can find Alice's items in a few different ways:

  1. Bob can learn of Alice's specific DID, and can find all of the items she is selling by using the nifty Dansdecentralist app (it's like Craigslist, but totes cypherpunk), which sends a GET query to fetch all the publicly exposed Offers Alice has published. Bob's app then uses the Offers Alice's SDS instance returned to display a for sale item list so Bob can see her wares.
  2. Countless apps like Dansdecentralist, dappyBay, etc. maintain a list of DIDs they periodically resolve for SDS instance URIs and crawl to check for new items they may have posted (or are made aware quickly via a crawl request from the DID owner). These apps try to build an index specific to Offers by crawling the open DID/SDS object web. This way the apps can present a searchable corpus of all the current offers they find when Bob opens the app and just wants to browse a bunch of different items people all over the world are offering for sale.

Alice wants to publish short social messages to her SDS that anyone can view

Alice loves to share terse messages that convey her thoughts and ideas to the world. She uses an app to craft amazing 280 character nuggets of pure knowledge-gold, which she posts as objects in her personal datastore. These objects intentionally expose their semantic type, such as SocialMediaPosting, which includes all the data that you would expect to see in legacy, centralized, 280-character social media posts.

Bob wants to view some short social posts from people around the world, some of which could be Alice's. He can see Alice's tweets a few different ways:

  1. Bob can learn of Alice's specific DID, and can find all of the short social posts she is sharing with the world by using CockADapperDoo (it's like Twitter, but, like, decentralized, man (read with The Dude's voice)), which sends a GET query to fetch all the short social posts SocialMediaPostings Alice has published to her SDS. Bob's app then uses the SocialMediaPostings Alice's SDS instance returned to display a list of her tweets he can read. Bob can they follow her, wherein the app retains a reference to her DID and periodically looks for new short social posts she publishes.
  2. Countless apps like CockADapperDoo, FreeBird, etc. maintain a list of DIDs they periodically resolve for SDS instance URIs and crawl to check for new short social posts they may have (or are made aware of). These apps try to build an index specific to SocialMediaPostings by crawling the open DID/SDS object web. This way the apps can present browsable feeds of the social posts that anyone can view to see what folks all across the world are talking about
csuwildcat commented 4 years ago

^ you'll notice both of the use cases above exhibit the same pattern, which I could repeat for the vast majority of common app use cases in the world. Hopefully, if we're successful, we create a substrate for a fundamentally new class of decentralized application.

bumblefudge commented 4 years ago

I very much welcome this use case-- particularly if some space travelers from the Activitypub metaverse have feedback on how FreeMammoth differs (if at all) in its topological assumptions :D

csuwildcat commented 4 years ago

@bumblefudge there are quite a few differences beyond the same basic idea that data has types and can be fetched. Most of these differences are related to ActivityPub's fundamental structure, for example: it's not a masterless system, it doesn't feature a convergent, conflict-free replicated data model, it doesn't do active-active replication of data objects, and it lacks support for permissioned access based on multi-recipient encryption.

csuwildcat commented 4 years ago

By the way: 3 years ago at an RWoT I asked the ActivityPub lead if he would consider rearchitecting the protocol to include masterless multi-instance operation with active-active replication of CRDT data and permissioned multi-recipient encryption + DID support, but the guy blew me off, so I stopped caring about something that fundamentally is not structured to achieve what I am interested in long-term.

agropper commented 4 years ago

The value of these two use-cases is clear. I look forward to how they inform our layers discussion.

Apps that maintain lists of subjects' DID and then crawl metadata are not obviously decentralized. "Bob can learn of Alice's specific DID" is unclear. Is Bob is using some kind of crawler across all possible DIDs?

However, I do see an impact on our layering discussion if we separate the data store from the metadata store through a standard protocol. Assuming that Alice's data needs to mostly stay in one place (ideally, the place that created it provides storage for a while, like the Costco VHS-to-DVD service automatically stores a copy for me to stream.)

The metadata however can be shared by Alice with various directories, social networks, and other aggregators that compete for that privilege. Alice benefits if the aggregators compete and if they are separate from the data store itself.

Alice also benefits if her access control service is separate from both the metadata aggregators and the data store. Not only is there more competition via our standards, but also Alice's privacy is protected if the storage operator is blind to all of the unsuccessful requests for the metadata. This is simple data minimization. The storage provider only knows of successful Bob requests.

Further data minimization occurs if the storage provider doesn't even know about the successful Bobs. In that case, Alice uses a proxy. The layering and standards should protect Alice's ability to select her proxy.

In summary, these use-cases suggest that Alice should be able to choose four separate roles:

Decentralization, separation of concerns, and privacy are joined at the hip by appropriate layers and standards. Alice MUST be able to choose each of the four separate roles independently of each other. In cases where a single entity chooses to offer two or more of the roles, Alice should still be able to keep them honest by retaining the right to detach one of the roles to elsewhere because of our standards.

agropper commented 4 years ago

One more thing about this search use-case. Bob, as the requesting party, has a privacy interest as well. Bob may choose to present different credentials depending on whether he's seeking authorization at:

36 suggests Access Request as the name for Bob's triplet. Whatever we name it, our layers and standards can help drive Bob's request in a privacy-preserving direction.

csuwildcat commented 4 years ago

@agropper PWA = Progressive Web App, which represents the W3C international standards around locally installable, offline capable web apps. Your app must have an App Manifest (https://www.w3.org/TR/appmanifest/), which allows the browser to natively install a web app on your local machine. The app then has access to Service Workers (https://www.w3.org/TR/service-workers/) which allow for robust offline use, http wire intercepts for your app's requests to serve from local cache, etc. You can install PWAs today if you use Chrome, Edge, or Firefox on most platforms. Many sites are already taking advantage of this, so do check it out.

The goal of the Secure Data Storage work should be to create a robust, universal, replicated datastore that can back a new standard, native Web API that PWA authors can use to achieve a truly serverless app model. In this model, an app dev does not need to know where the user's personal datastore resides at all, they just need to request permission to write data to a subset of it. This allows for apps that have no backend (from a developer's perspective), which is basically the ultimate holy grail for all web devs who have ever lived, as well as a truly scalable model for doing serious decentralized applications (dapps).

csuwildcat commented 4 years ago

Apps that maintain lists of subjects' DID and then crawl metadata are not obviously decentralized. "Bob can learn of Alice's specific DID" is unclear. Is Bob is using some kind of crawler across all possible DIDs?

There are many ways Bob can learn of Alice's DID:

  1. Meets her at a conference, and they exchange DIDs via NFC/QR
  2. Finds a reference to Alice's DID online, via some blog article or mention of her in an app
  3. Bob runs a DID crawler locally, or uses data from an external one, to traverse the DID-space (not all methods allow for a flat, globally iterable list of all DIDs in the PKI substrate, while others do - those that don't will be at a distinct disadvantage in serving many use cases users will come to rely on)
csuwildcat commented 4 years ago

What I hope we can agree on is that we're creating a global, uninterdictable substrate of IDs that allows users to exchange DIDs as peers and can be crawled to ping personal datastores for all sorts of data their owners want to offer either publicly to all, or more secretively to one or more select parties. This DID substrate + datastore combo can be used to decentralized many of the applications that are reliant on the centralized app models of today. My hope with these use cases, and the overarching concepts they represent, is that people understand the goal is not just a DID encrypted Dropbox clone that allows for super double secret probation exchange of your driver's license, which may only represent 1% (or probably less) of the data that will be flying through 'SDS' instances. I would encourage you to read this blog post to more fully understand my position: http://www.backalleycoder.com/2018/01/25/identity-is-the-dark-matter-energy-of-our-world/

cwebber commented 4 years ago

By the way: 3 years ago at an RWoT I asked the ActivityPub lead if he would consider rearchitecting the protocol to include masterless multi-instance operation with active-active replication of CRDT data and permissioned multi-recipient encryption + DID support, but the guy blew me off, so I stopped caring about something that fundamentally is not structured to achieve what I am interested in long-term.

. o O (Am I the ActivityPub guy? Did I blow @csuwildcat off? I don't remember this interaction... I've been thinking about ActivityPub and DID integration and etc for a few years now... it was the first RWoT paper I submitted......)

agropper commented 4 years ago

Thanks, @csuwildcat for the PWA perspective.

I think your goal is to create a DID-compatible online service endpoint to complement and support the PWA ecosystem while also supporting the decentralization and data minimization principles of the SSI community. If so, we agree and share the goal.

The difference between us is probably in the DID-service endpoint approach. I hope that Alice's DID service endpoint points to her online agent. You seem to hope the service endpoint points to her data store. These are vastly different visions but maybe we can find a way to support both as co-equals.

msporny commented 4 years ago

I asked the ActivityPub lead if he would consider rearchitecting the protocol to include masterless multi-instance operation with active-active replication of CRDT data and permissioned multi-recipient encryption + DID support, but the guy blew me off

Haha, that would be @cwebber, who I have never known to blow anyone off... in fact, quite the opposite, his general mode of operation is engaging people that he may not agree with for years at a time and being very thoughtful of the problem space and its solutions.

csuwildcat commented 4 years ago

Perhaps "blow off" is too strong - more like: spent time imploring folks leading the project to work toward significant changes that would tick the required boxes, but folks were not receptive (at least at that time), so discussion ceased. I never really received any reason, besides the desire not to change direction from what they were doing. It's certainly not something I'm mad about, I just didn't give ActivityPub a second thought after that because it lacked the required foundational structure to fit the requirements. I suppose I'm too blunt/empirical sometimes, so no offense intended.

csuwildcat commented 4 years ago

I hope that Alice's DID service endpoint points to her online agent. You seem to hope the service endpoint points to her data store. These are vastly different visions but maybe we can find a way to support both as co-equals.

Why does Alice need an intermediary server to go through for people to access the tweets she is publicly publishing for all to see in plaintext? I have a blog site today, and if I wanted to do it via the DID model, are you saying I would need people to needlessly traverse some online Agent thing to read my blog posts? I see it this way: the data and message storage and relay node (SDS) should be listed in the DID Doc, and you should absolutely be able to talk directly to it. It can't decrypt encrypted data, all it can do is replication and light sorting/indexing if any metadata to do so is provided for a given object. It also houses some totally public data, and for that data, it will serve it to anyone, because that is the most common case for the vast majority of data a user generates. Believe it or not, the bulk of data most users generated is intended to be consumed by all - just like this very string of identity messages we're posing here, on GutHub. If an Agent is present in the equation, imo, it should act on data/messages the datastore is sent that the datastore itself cannot decrypt (if encrypted). The Agent is basically a web hook-esque consumer service that can hook into messages/data that arrive in the SDS that it needs to do things with, like decrypt and act on.

OR13 commented 4 years ago

Please add comments regarding likes / dislikes for this use case... needs additional opinions...

Consider how this discussion effects layers... and how we can get our high level interface on a safe lower level interface...

agropper commented 4 years ago

It's not "needless", it's essential. For privacy reasons, Alice must process the requests for access rather than delegate that to the data store. We need to make that "needless" trip through Alice's (self-sovereign) authorization server so that Alice gets to decide how much her storage provider gets to learn about her through traffic analysis.

Just like we want some content encrypted to keep it away from use by the storage provider, we also have to protect Alice from misuse based on surveillance of Alice's activity, especially the Access Requests by Bobs.

The "sand in the gears" argument is not an excuse for locking in Alice OAuth2-style. It's up to us to enable this separation of concerns as efficient as possible. Arguing for computational efficiency by bundling services into a platform is brings to mind 20th Century clearing houses, not the redecentralization of the Web.

Do not hesitate to ask me how I really feel...

csuwildcat commented 4 years ago

@agropper I don't have to 'process' requests to access my public blog posts, tweets, resume data, any tons of other things I hope the SDS eats alive, and I don't think people want that. What they want is a place to stick all the stuff the publicly publish now, but in a decentralized datastore, not some centralized application silo. I feel as though you may be forcing a very specific paradigm on the SDS that the SDS owner would only want to have in place for a subset (probably a small subset) of the data it houses. The reality is that the vast majority of data you generate in a given day you post to entirely public places where the express desire is that there is nothing standing in the way of people anonymously viewing it without needing to ask your permission. We are literally doing this right now by posting identity data transmissions (our comments on this Issue) to each other in full public view. I simply want SDS to eat literally everything on the planet, and most things are public.

agropper commented 4 years ago

@csuwildcat I agree with you that public things don't need an authorization server because they are public. They may not need a DID either depending on how authorship and licensing are done. A public posting does not require a secured Access Request #36. Bob can preserve his privacy by using a VPN or alias.

In this case, the only secure thing about a secure data store is write permission which should go through Alice's authorization server for many good reasons.

It doesn't look like we differ on Access Request processing. What remains to work out is search - the way some requesting parties discover Alice's secure data store. I maintain that the metadata aggregators must be separate from the secure data store by default (privacy by default, separation of concerns) and that Alice might choose to spread metadata among multiple aggregators at any one time.

I see no benefit to bundling search with storage and many downsides in terms of decentralization. Bob's Access Request to the entity that controls the aggregator will typically be different from the Access Request that controls the secure data store. Bundling the aggregator with the data store just increases the privacy risks for both Bob and Alice for very little if any efficiency benefit.

jmandel commented 4 years ago

Reviewing the discussion here, I'm not seeing objections to the proposed use cases. It might be good to split other aspects of discussion into distinct issues.

A key questions in discussion seems to be whether negotiating access is in-scope for the job of a Secure Data Store, or a different layer in the stack.

I don't think this is a "hubs vs agents" situation. Alice's DID Document can list her data stores and agents. If Bob needs access to Alice's non-public data, it's possible that Alice can just set this up for him (e.g., Alice meets Bob at a party, learns his DID, and directly configures her SDS with a policy to share photos from that day). Alternatively if Bob wants to request / negotiate for access, he can go talk to Alice's agent; the agent can arrange things with the SDS on Bob's behalf, and then Bob can talk to the SDS to read or write data. (To be clear, the agent can do fancy things like checking in with Alice, awaiting payment, etc — and can also do things to help Bob keep his identity private from the SDS, like reviewing some of Bob's credentials and then letting Bob set up an interaction-specific DID talking with the SDS.) None of this obviates Alice's ability to directly configure policies in her SDS.

I'm going to catch up with @agropper offline next week to try to understand his perspective and concerns better.

venu2062 commented 4 years ago

I also think that there is nothing to suggest that these use cases are not in the scope of SDS. A lot of discussion here seems to be about how the stored data may be discovered, crawled, copied and used.

Going through all these, I have a more fundamental question on the charter of this group (I can create another issue to debate this):

Positive vs Notional control over data

In my mind, all applications and services we use so far give us a notional control over our data in the sense that someone else (most of the time, the providers of the service) can override our control. My point is that this exists in many forms already.

So, I thought that the primary objective of this group is to arrive at proposals and mechanisms for a positive control over the data - control that cannot be overridden (exception, of course, is that it is copied outside the secure store).

msporny commented 4 years ago

I thought that the primary objective of this group is to arrive at proposals and mechanisms for a positive control over the data - control that cannot be overridden (exception, of course, is that it is copied outside the secure store).

Yes, @venu2062 -- I think you are exactly right! It's worth making this point and distinction in the next call.

agropper commented 4 years ago

That's a fine requirement but doesn't it have to be evaluated in terms of real use-cases or is it an issue with the scope of our charter?

On Tue, Jun 9, 2020 at 10:55 AM Manu Sporny notifications@github.com wrote:

I thought that the primary objective of this group is to arrive at proposals and mechanisms for a positive control over the data - control that cannot be overridden (exception, of course, is that it is copied outside the secure store).

Yes, @venu2062 https://github.com/venu2062 -- I think you are exactly right! It's worth making this point and distinction in the next call.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/decentralized-identity/secure-data-store/issues/60#issuecomment-641350788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABB4YMOPMDBIVXR6AA3IR3RVZENVANCNFSM4NRIETEA .

venu2062 commented 4 years ago

That's a fine requirement but doesn't it have to be evaluated in terms of real use-cases or is it an issue with the scope of our charter?

It goes to both clarity of scope and the use cases. This group started as part of the Decentralized Identity community and many use cases that we want to support are relevant here. But the use cases being discussed seem to be highlighting the need to provide access to others or the public but seem to miss the control side of it. So, I thought that it should be pointed that out.

venu2062 commented 4 years ago

There may be multiple aspects of positive control to be considered :

1) Ensure that positive control is retained over private data 2) Ensure that data exposed to the public is not tampered with

It may not be possible to guarantee either of the above solely by the SDS server layers but there should be recommendations or the data formats that strongly drive towards those goals.

bumblefudge commented 4 years ago

I think the question is not whether granular control over private data is out of scope-- it seems to me we all agree certain kinds of data require the "triplet" Adrian keeps bringing up, and granular external authorization. (If so, I should probably chat with Adrian about putting triplet and that IETF terminology for authorization that Nikos brought up two calls ago in the glossary!). I have seen few comments on Adrian's medical data use case that would lead me to believe this group wants that kind of data pushed out of scope or relegated to a second-class of use case.

The question instead seems to be how to balance the other use cases, where separation of concerns is less important against the private data use case where at least the option of external authorization mechanisms seems essential. If we're trying to serve all four use cases, it might well be that <1% of PWA developers avail themselves of the external authorization option, but if we agree that such an option is necessary for that other use case, it's at least worth specifying how it would look to extend that option (however unappealing most of the time) to the others, right?

OR13 commented 4 years ago

@csuwildcat @dmitrizagidulin to make sure there is a PR for use cases and close this.

bumblefudge commented 3 years ago

Happy Canada day everyone! Almost the 1year anniversary of this "ready for PR" tag.

agropper commented 3 years ago

My concerns in this issue have resurfaced as preamble to the broader SSI authorization discussion https://github.com/w3c-ccg/community/issues/195