Decide on and specify a file structure and association model

csuwildcat commented 3 years ago

We need to make a decision about what the underpinning file structure and file association model will be, and get it specified to the point we can be confident an implementer can act on it, and hopefully leverage code from the community to fill the gap.

@expede, @bmann, @justinwb, @carsonfarmer, @wyc I think this would make a ton of sense for the first implementers call, and am hoping we can get this spec'd such that we can use an existing module from the community.

oed commented 3 years ago

Does this refer to the file structure exposed in an api or the actual underlaying data structure of the event-log?

csuwildcat commented 3 years ago

From the interoperable standards perspective, perhaps we could get away with just standardizing the format, encryption, and linking between objects, given you could store/log them differently under that (I suppose?)

carsonfarmer commented 3 years ago

Yes, I think coming to an agreement on the “high-level” data APIs, seems (almost) doable. For sure there will have to be consensus on “file-level” encryption and linking (which should probably just be IPLD primitives), but for instance, I wouldn’t expect everyone to agree to adopt cryptree (even though it does appear to rock!). For format, I think maybe some IPLD-based representation of “folders or buckets” would work. If we want easy interop with IPFS/Filecoin, then unfortunately, UnixFS seems like the right move.

Having said all that, our specification should mandate that things can be represented, queried, and accessed in this way, but that implementers are free to store data in different ways. Something like that seems reasonable. Otherwise, it's unlikely we’re going to get much of a consensus at all.

I guess the other thing to keep in mind is that we’re just talking about Identity Hub stuff here… this is going to fit lots of use cases, but adopting something here, doesn’t mean we can’t adopt something different elsewhere in our own stacks…

Just some Friday thoughts. I look forward to the call next week.

csuwildcat commented 3 years ago

@carsonfarmer I think the litmus test for success will be: can Alice have Implementation 1 of this thing running on Device X and Implementation 2 running on Device Y, and have the two sync with each other and function as two replicated instances of the same logical whole, with the same capabilities/behaviors.

csuwildcat commented 3 years ago

@oed can you post some materials here about DAG-JOSE, so folks can check that out? Question: does it introduce standard linking structures/props for objects and their subsequent mutations? (e.g. delta-based CRDT objects modifying a state of a target object)

oed commented 3 years ago

Check the ipld spec: dag-jose. Will go more in depth how it works and how to use with DIDs on our call tomorrow. DAG-JOSE doesn't prescribe any specific data structure, except that the payload needs to be a CID. So you can link any data structure you desire!

csuwildcat commented 3 years ago

@oed I suppose IPLD itself encodes the lineage of the objects that link to each other, so maybe we don't need to specify anything further about what is a root, parent, child, etc. in this spec, if that is already implicitly handled for us?

carsonfarmer commented 3 years ago

Correct, by embracing IPLD, one gets self-describing “linked data” out of the box more or less…

csuwildcat commented 3 years ago

@carsonfarmer I just wasn't sure if it provided for all the sorts of traversal of a logical object's lineage that we would need, but if all that is baked in, then yes, I think file structure is not really an issue. I have only really ever used IPFS for singular files that have no complex connections to one another, so I will try to read up more before tomorrow.

carsonfarmer commented 3 years ago

Well the key point here is, we need to loosely define our data structure such that there are linkages to previous states if that is the thing you want to capture. For instance, in Threads, every update encodes the core payload, and a link to the previous update (among other things) as a IPLD link. This makes traversal to past states possible, and it also facilitates things like incremental updates, syncing history, etc.

But IPLD itself, being a purely functional data structure, doesn't have any concept of history itself. It is just a Merklized data structure at the end of the day.

csuwildcat commented 3 years ago

@carsonfarmer makes sense, and sounds like Threads does what is required in this area in a rather straightforward way (what you described above is basically how we wrote it up for the first Identity Hub prototype many years ago that didn't use IPFS).

I wonder where something like dag-jose falls into the mix? @oed, @carsonfarmer how might something like the way Threads is structured play with DAG-JOSE? Are these isolated enough in function that they are cleanly mergeable concepts? (for example: what might it take to use DAG-JOSE formatted/encrypted objects in Threads?)

csuwildcat commented 3 years ago

@oed seems like we could leverage much of the linking/association conventions and formats of Threads by codifying a standard that used the Threads' props/values/mechanics they currently encoded into their files/associations as header values in the dag-jose formatted objects?

carsonfarmer commented 3 years ago

Yup. Honestly there are multiple possible routes here. 1) dag-jose is used as the core ipld-codec for a thread (at the top level), 2) dag-jose is used for the record payloads and we stick with dag-cbor at the thread level, 3) some combination of these, etc.

csuwildcat commented 3 years ago

@carsonfarmer could you speak to the tradeoffs between the options you presented? Also, @oed, do you have any opinions about the options Carson noted?

oed commented 3 years ago

Well, dagJose is really only meant to encode the signature / encryption envelope by itself. That can then link to the main data structure. To me the bigger question is how we format the linked data structure itself. I think the optimal outcome would be if we could achieve something close to a CRDT on a UNIXfs tree structure. But honestly there are a huge amount of nuanced tradeoffs here that we need to discuss.

csuwildcat commented 3 years ago

@oed @carsonfarmer @expede I want to pose a purely hypothetical as a way to better see the forest from the trees with regard to how we can have interop across all the roughly similar solutions out there:

Assume we have standardized on DAG-JOSE/DAG-CBOR
Assume we standardize on Cryptree

What would be the component in each of your projects (Thread, Tile, Bucket, WNFS, etc.) you could modify so that the base level structures are the same, and can be sync'd across one another? Another hypothetical: say for the reference we picked one of the Textile primitives, and, assuming it was modified at the object/structural level to adhere to the shared standards, would there be any issue with that from your perspectives? How can we get a code the fastest on a track that would ensure the reference implementation was not 1) some one-off thing that can't interop with each of yours, or 2) only work with the project from which we leaned on for some of the code?

carsonfarmer commented 3 years ago

I'm not sure I have a specific answer to your question here per se, but I do like the setup. Let's indeed assume we have settled on the above two "standards". Perhaps a really great first step is to just spec out what a minimally-compliant data structure based on Cryptree and DAG-JOSE looks like. In doing so, we might get a better idea of what, on each of our respective stacks, would need to be changed/added to get there. For Textile, we'd likely have to support Cryptree at our "Buckets" API level, and this would require a Go implementation of Cryptree etc. There's no way we'd approach such work without a specification that we all agreed on first. Obviously even better would be an existing implementation that matches the spec, but we wouldn't require that per se before work might be able to start on our end.

The nice thing about focusing on this aspect of the specification is it is a very targeted goal, with a really clear outcome: A minimal specification that we all would consider provides sufficient interoperability for working across projects. I'm happy leaning heavily on the existing Cryptree paper and @expede's optimizations. So that the actual amount of writing new content would be pretty minimal in my mind?

oed commented 3 years ago

For Ceramic we are also interested in a minimal spec as @carsonfarmer suggests. We would like to support this as a Ceramic StreamType, so an agreed upon spec seems like a good first step!

expede commented 3 years ago

A minimal specification that we all would consider provides sufficient interoperability for working across projects.

Also agreed on a minimal spec first. Can we scope what we mean by "minimal" here? Just the low-level data structures? Do we want to expose an interface, or actually all use the same low-level data layout? Should write auth included in the spec?

We would like to support this as a Ceramic StreamType

These are the exact kinds of questions that are still very open in my mind. My guess is that a stream probably won't support encrypted hierarchical changes at the data level? Fission's implementation goes out of its way to hide the order in which things were done, but we do lets peers stream updates in realtime over secure authenticated channels in realtime. Maybe support coarse DAG diffing in the stream? I do think that it's possible to do this at the materialized data layer, which would make this sync agnostic, but again a question of scope.

csuwildcat commented 3 years ago

@expede I think the issue of requirements shakes out like this:

Primary goal: Two 'Identity Hub' compliant nodes can be used by Alice and will sync, store, and interact will each other and calling entities to achieve a shared state.

For me, this seems to imply:

They need to use the same object linking/relationship structures
They need to use the same object security/format wrappers
They need to use the same encryption scheme
They need to use the same permissioning/capability scheme, so enforcement is unified across them
They need to offer a minimum set of APIs that they all support, so devs can reliably interact across them
They need to sync data in a way they all can participate as masterless replicants

^ I feel as though this list is the brass tacks for achieving the desired outcome, because if you drop any one of them, you end up with one-off implementation silos (e.g. an instance developed or run by MSFT can't sync/interact with an instance developed/run by Fission, Textile, Ceramic, Spruce, etc.)

What do you all think about this list, and the implications of it?

expede commented 3 years ago

That's a long list — pretty much a top-to-bottom spec. I really like the idea of 5, since it implies that there's a high level interface, but it feels like that's contraindicated by the low-level constraints in 1-4.

^ I feel as though this list is the brass tacks for achieving the desired outcome, because if you drop any one of them, you end up with one-off implementation silos (e.g. an instance developed or run by MSFT can't sync/interact with an instance developed/run by Fission, Textile, Ceramic, Spruce, etc.)

I continue to feel misaligned on this as a goal. Why be forced to replicate the entire data store across multiple providers? Is that useful? An alternative is a high-level interface that abstracts over a bunch of this detail, no? We can also support read-interop before we go into write-interop and sync (which seem much more complex and contentious). Read interop already goes a long way in breaking down the walls between data silos, no? It would be possible to do a higher level access API that doesn't even depend on the same encryption scheme (though encryption seems the easiest to align on and make extensible).

oed commented 3 years ago

Can we scope what we mean by "minimal" here?

I think it would be nice to start at the low-level data structures.

These are the exact kinds of questions that are still very open in my mind. My guess is that a stream probably won't support encrypted hierarchical changes at the data level? Fission's implementation goes out of its way to hide the order in which things were done, but we do lets peers stream updates in realtime over secure authenticated channels in realtime. Maybe support coarse DAG diffing in the stream? I do think that it's possible to do this at the materialized data layer, which would make this sync agnostic, but again a question of scope.

So a Fission peer maintains just the current set of DAG tips/roots? Then syncs these with other peers when needed? This is definitely something we are looking to support in Ceramic. I suppose the main difference would be that Fission peers communicate in private, while Ceramic is a more public network right now.

csuwildcat commented 3 years ago

@expede I think we may be a bit closer than we think on more than just Read-level interop, if we can make the spec open enough to get some basic, standard requirements in place - for example:

Use the same object linking/relationship structures

If the spec constrained itself to only talking about logical objects, assuming a simple flat, atomic datastore, I believe it should be possible to overlay/integrate a more complex structure over the top (e.g. some tree structure). If we can keep this ultra light, I think we could still achieve effective interop.

Use the same object security/format wrappers

Success! We've selected DAG-JOSE and DAG-CBOR

They need to use the same encryption scheme

Would it be possible to specify a set of curves/algs/scheme options at the atomic object level, and allow that to be pluggable?

Use the same permissioning/capability scheme, so enforcement is unified across them

I don't see why we can't have permissions/caps formulated in the same way, even if the encryption strategies are pluggable. Perhaps others know why this might be?

Must offer a minimum set of APIs that they all support, so devs can reliably interact across them

I definitely think we can put a set of top-level HTTP/object based API formulations in place that can work across implementations.

Must sync data in a way they all can participate as masterless replicants

Syncing data is different than dealing with the payloads of data, so I do think (assuming we can land on a basic association structure) syncing the current state of held objects is something we can unify.

All and all, I think the two biggest question marks are around encryption and permissioning/caps. However, if we can simplify the mandatory set of normative requirements, we can probably make those areas less contentious.

csuwildcat commented 3 years ago

@oed your comment made me wonder if we could agree on basic sync strategy that doesn't imply too much beyond what will likely can be done in all implementations: if all implementations can at least think of all their logical objects in an atomic way, even if they want to add more complex associations/layers, wouldn't it be possible to specify a mechanism for sync that ensured all instances knew about the latest logical object roots/tips, then sync'd them (using IPFS primitives where applicable) across one another? For example: If the spec said "Every logical object shall be represented as an IPDL structure, and when a sync is started, two instances diff the set of roots/tips to replicate across any missing nodes"?

expede commented 3 years ago

Please do correct me if I'm misunderstanding. We're looking for a minimal scope to get started. What I'm reading above sounds like we currently have the following in scope for the first pass:

Fine-grained data structures
Encryption schemes (read access)
Authorization system (e.g. write & admin access)
Replication mechanism, signalling channels
Merge strategy (e.g. CRDT, overwrites)

These are all absolutely possible to do, but none of these are trivial items. Is it not advisable to start with something smaller, get some wins and move forward? Standards take time and effort to gain agreement and work though the details and edge cases. Is there a way to eat the elephant one part at a time?

Not in scope today:

Group encryption
Revocation
Restricted network access (don't send bits over the wire without auth)

More detailed questions and clarifications below.

Would it be possible to specify a set of curves/algs/scheme options at the atomic object level, and allow that to be pluggable?

I think so, yes 👍 IMO this read-access spec would be an easy win to start on

Fission peers communicate in private

We communicate over the public IPFS network, but with encrypted data. Changes are signaled to peers with DNSLink and pubsub (both over public infrastructure).

If the spec constrained itself to only talking about logical objects, assuming a simple flat, atomic datastore, I believe it should be possible to overlay/integrate a more complex structure over the top (e.g. some tree structure). If we can keep this ultra light, I think we could still achieve effective interop.

Can you describe this in more detail? I'm unclear how one would build up the additional structure reliably. Do you mean an extensible event stream?

So a Fission peer maintains just the current set of DAG tips/roots? Then syncs these with other peers when needed?

Roughly, yes. Each WNFS holds its current Merkle root, and those of anyone else its interested in. It's public IPFS underneath, so the IPFS node checks the WNFS root via (e.g.) DNSLink and then fetches any missing blocks to have two, potentially diverged trees. The fetcher performs a coarse grained (hierarchical, file-level) resolution, which analyzes if it's ahead or behind or actually diverged and from where. Since IPFS doesn't do encryption out of the box, we've added a bunch of extra mechanisms to keep data and metadata hidden from providers, and let this merge happen incrementally by users that have access to that portion of the subgraph.

"Every logical object shall be represented as an IPDL structure, and when a sync is started, two instances diff the set of roots/tips to replicate across any missing nodes"?

This is roughly what WNFS does 👍 For the data to remain coherent across replicas while allowing writes on any of them, we would need to agree on merge strategy, which either means from low-level (IPLD) primitives or with a high level semantic API. That doesn't feel like a minimal scope to me, but maybe I'm wrong and it's already a solved problem on both public and encrypted data?

Are others retaining both histories in merges? What are your merge strategies like @oed and @carsonfarmer?

Syncing data is different than dealing with the payloads of data, so I do think (assuming we can land on a basic association structure) syncing the current state of held objects is something we can unify.

Do you mean replace the local with the root from the other data store? Does this one also pull agreeing on low-level CRDT and auth as well?

I think it would be nice to start at the low-level data structures.

@oed Interesting; why low-level structures in particular? Won't that constrain us from having higher level abstractions / differences in implementation? Perhaps we're using different definitions of low-level? I'm taking it to mean the actual IPLD layout of the node, which pointers it has, file headers, linking structure (child/parent, historical/versioned), and other typical file system implementation details (i.e. how the inodes are laid out in storage).

I definitely think we can put a set of top-level HTTP/object based API formulations in place that can work across implementations.

👍 Awesome

I don't see why we can't have permissions/caps formulated in the same way, even if the encryption strategies are pluggable. Perhaps others know why this might be?

Like the rest, this is also totally doable! I'm just trying to get a scope for what we mean by "minimal scope", and that is a whole other set of specifications from sync.

oed commented 3 years ago

Are others retaining both histories in merges? What are your merge strategies like @oed and @carsonfarmer?

Currently we use an "earliest anchor rule" which means that the event that was anchored into a blockchain first will be selected. This is used in order to achieve secure key revocation for our DID method. We plan on supporting multiple merge strategies in the future.

why low-level structures in particular?

If we want to allow syncing between idHubs (on the libp2p layer) we need to have the same IPLD layout so that implementations can understand each others data. Otherwise, if we just speak some standard http api, it doesn't even make sense to standardize around the use of IPFS at all IMO.

csuwildcat commented 3 years ago

I agree with @oed that the structure, security wrapper, and descendant node linkages for logical objects in the system must be the same, because if those aren't the same, we wouldn't be able to reliability sync data for backup to other instances with even a basic replication protocol (e.g. diff a linear log of pinset history between nodes).

A level of detail below that, if we used a strategies model for encryption, client payload data merge, etc., it would allow interop on how to deal with those facets of object/data handling. Even if we don't all use a common set of strategies immediately as we're doing the spec, that type of plumbing will allow for convergence of implementations after the spec and recommended strategies are codified.

expede commented 3 years ago

if we just speak some standard http api, it doesn't even make sense to standardize around the use of IPFS at all IMO.

Right, but using IPFS isn't the goal, it's data interop across systems. I mean, sure we can work through the specific data structures look like, but with the understanding that we're going to need to heavily converge our systems.

@csuwildcat okay, sure, we can tackle the entire stack at once. It's just going to take a while, which is fine, but we need to be really clear on the process tradeoffs that implies.

csuwildcat commented 2 years ago

The structure of the messages is now stabilized in the spec. Closing for now, please open any issues on the existing spec structure.

decentralized-identity / decentralized-web-node

Decide on and specify a file structure and association model #61