atomicdata-dev / atomic-data-docs

Atomic Data is a specification to make it easier to exchange data.
https://docs.atomicdata.dev
MIT License
17 stars 7 forks source link

Decentralized persistence and resolving #112

Open joepio opened 2 years ago

joepio commented 2 years ago

As a protocol, Atomic Data is mostly designed with the assumption that HTTP URLs do not change, and will continue to be hosted for as long as needed. In practice, this does not always happen. This basically means that everytime you use an externally defined thing (such as a Class or Property), you introduce a dependency. The source may go offline anytime.

We currently deal with this issue by simply caching things server side. In effect, all Properties and Classes that a server encounters are saved locally. This works for this server, but what happens if someone else wants to use this data? If they try to get the Property, for example, that no longer is hosted, they have no alternative means of resolving the URL.

I'm looking for a system / protocol that gives users the option to find resources that have gone offline at their original source.

Some considerations:

Let's discuss various approaches to this problem here:

IPFS

A really interesting technology that allows for content-based adressing. See #42

Great for static stuff, but not that great for things that change over time.

The rust version does not offer DHT support as of now, and development appears to have stagnated.

One way that, to me, seems particularly interesting, is to add the IPFS identifier to the HTTP url. Basically, we get a URL like this: https://atomicdata.dev/someresource?ipfsid=QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdD. That way, the subject contains information about who is in charge / where you should fetch the data (atomicdata.dev) and also, which version was used (ifps is a hash of a specific version) and where you can retrieve that data if the HTTP url does not resolve. Pretty cool stuff, right? See #64 about hybrid identifiers.

Hypercore

HyperCore is fundamentally a protocol for a replicated append-only log, although it also supports higher level K/V and Filesystem based data structures. It is a bit like bittorrent, but more dynamic. Logs have a public key as an address, where (everyone) can append to.

It has a Rust crate, although it is in beta and doesn't seem to be actively developed anymore. Not a problem per se, if its current state is good enough and the code is maintainable. Also [this one]https://github.com/datrs/hypercore-protocol-rs) from the Dat project.

We could store Atomic Commits using Hypercore. One log per resource, which represents all the changes to that specific resource. The secret key for the Resource is maintained by the one creating the Commits. We could share the public key in the Resource. When this hypercorePubKey is present, the committer also sends the new signed commit to the Hypercore feed.

I think it's probably best if the server maintains the secret key for the feed (by default), because that way it could also create server-side changes by other users through authorization of other users. In other words, you could invite a different user to append to your log, if certain conditions are met. For example, you could invite others to post messages to your chat room this way - withouth having to share the secret to your log.

What would we gain from this? Well, we'd have an extra way to retrieve Commits, even if the server goes offline. At least, in theory, if others have replicated the data.

Custom Merkle tree / DHT implementation

Perhaps it makes more sense to build something custom for Atomic, something lightweight and designed in conjunction with other parts of Atomic.

It needs to:

Nah, too much effort.

Asking atomic servers to resolve external HTTP resources

We currently have the /path endpoint, which accepts any Atomic URL / subject. A client that wants to use some property that appears to be offline, could then ask any atomic server for this http://someproperty resource.

This approach has a couple of limitations:

Also, if servers would serve external content, it may be worth sharing some cache-related metadata (thanks @jonassmedegaard), such as caching policy, and date fetched / cached.

We could maybe solve this by introducing some form of discoverability. Not sure how that should work, though.

jonassmedegaard commented 2 years ago

What makes sense to me is to distinguish multiple kinds of remote data handling:

AlexMikhalev commented 2 years ago

@jonassmedegaard I like the idea, we can probably model and implement it as typed DAG.

joepio commented 2 years ago

@jonassmedegaard good ideas, these different usecases deserve clear distinctions. The process of forking (cloning + editing) external resources should be specified as well. In the last three usecases that you mention, the subject of the resource should probably change. A clone / mirror or fork (mirror + changes) will be a separate resource from the original one, although they definitely need a well-specified relationship (i.e. a clear predicate).

In the opening post, I was mostly thinking about the first usecase: referenced data, where the subject remains constant. With a constant subject, decentralized resolvability becomes a challenge.

jonassmedegaard commented 2 years ago

I disagree: I think only a cloned resource should be renamed and tracked as a separate issue - a resource that is mirrored is still the same, including subject.

I.e. what I mean by a mirrored resource is owl:sameAs.

jonassmedegaard commented 2 years ago

I imagine the Atomic Server would only cache at first, and if at a later refresh of the cache the resource had gone then flag it as needing action. One action could be to try again, another could be to drop the resource (flagging all inverse dependencies of it as needing action - a different action involving cutting loose dead parts), and a third action could be to promote the cached copy to either a mirror (which would be treated as read-only) or a fork (which would be read-write by those with appropriate access rights).

I guess the default proposed action might be different based on how the resource had disappeared - e.g. host unaccessible or a 5xx response might lead to proposing the action of turning into a copy, whereas host serving different content might lead to proposing turning the cached copy into a forked resource.

joepio commented 2 years ago

I disagree: I think only a cloned resource should be renamed and tracked as a separate issue - a resource that is mirrored is still the same, including subject.

I.e. what I mean by a mirrored resource is owl:sameAs.

Every place I've seen owl:sameAs used, is where the subject is different from the object. Otherwise, we'd get example:a owl:sameAs example:a. The sameAs relation denotes semantic equivalence, not mirroring of the actual triples.

But I think we probably mean the same thing with mirroring: creating a local clone of some external resource, use a different subject (or else the mirrored resource could not be resolved), and consider the values immutable. What do you think?

jonassmedegaard commented 2 years ago

Right.

I think I got confused about the very essential meaning of this issue: If you truly are talking about "same subject" then there cannot be any decentralization, only caching of One True Source - because that one single source is the subject.

jonassmedegaard commented 2 years ago

I.e. you cannot "embed" an external resource without its RDF subject changing, and you cannot offer decentral resolving because there is only one authoritative source (every indirect access can only possibly be cached data).

So essentially this very issue cannot be about "same subject" (and therefore I simply ignored that sentence in your previous post here).

...or what am I missing?

jonassmedegaard commented 2 years ago

I mean, either we are talking about resources that each can only have one true identifier, or we are talking about resources that each can have one or or more identifiers (i.e. semantically equivalent identifiers).

Which is it?

joepio commented 2 years ago

I.e. you cannot "embed" an external resource without its RDF subject changing, and you cannot offer decentral resolving because there is only one authoritative source (every indirect access can only possibly be cached data).

Decentralized resolving with only one authoritative source is not necessarily impossible:

I mean, either we are talking about resources that each can only have one true identifier, or we are talking about resources that each can have one or or more identifiers (i.e. semantically equivalent identifiers). Which is it?

I'm talking about a single, decentralized, resolvable identifier. Semantic equivalent identifiers are interesting for other reasons, but not for this issue.

Still, I definitely believe that you make a valid distinction in your first comment: there are multiple reasons for using remote data, and various types of relationships between source and user.

jonassmedegaard commented 2 years ago

Ok, so when you say "embed" here, you really only mean "cache". With that constraint it makes sense to me (otherwise not).

joepio commented 2 years ago

Ok, so when you say "embed" here, you really only mean "cache". With that constraint it makes sense to me (otherwise not).

When I mention embed, I'm talking about embedding the actual application / dependency that deals with decentralised resolving. For example, an embeddable IPFS implementation, or a different type of library that can be embedded in the Atomic-Server binary. I want to prevent introducing a runtime dependency. I'll edit the OP to make this a bit clearer.

jonassmedegaard commented 2 years ago

Makes more sense now. Thanks!

Essentially this is related to the "...but what if the web is lost" problem of networked resources. Centralized designs "solve" this by making the central point stronger. Some decentralized systems "solve" this by having each client store a full copy of the whole web (e.g. blockchain designs). Some decentralized systems acknowledge that this is not sensibly solved fully, only loosely addressed by viewing the web as a moving target - an organic mesh of pieces that each may disappear.

This is the reason I laid out the ways to secure knowledge about external data points - recognizing that they may disappear.

You can replace http identifiers with all-is-on-a-blockchain identifiers, but that does not change the fact that data points may get lost, it just changes how it happens: With blockchains it gets lost by the web growing too large to truly be fully mirrored and techniques to "omit the less important bits" then occationally optimizing away the bits that you need.

So sure, you can choose to use IPFS identifiers instead of http identifiers, shifting your choice of underlying "weawing tech" for the web you want your system to rely on. And then embed the code to handle that identifier type.

I would be sad if you chose to use only blockchain-based IDs for Atomic Data, because that would massively loose the ability to weave a web of both Atomic Data and Solid nodes.

joepio commented 2 years ago

I would be sad if you chose to use only blockchain-based IDs for Atomic Data, because that would massively loose the ability to weave a web of both Atomic Data and Solid nodes.

I fully agree that completely moving to blockchain IDs would be a bad approach. Most (if not all) blockchain solutions are far too slow, anyways.

You can replace http identifiers with all-is-on-a-blockchain identifiers, but that does not change the fact that data points may get lost

True, data can always get lost. But there are some characteristics of Atomic Data that could help make it less likely that data becomes lost. If every server that has a dependency on some external resource also caches this resource and advertises its caching to others, we get a degree of redundancy that makes it far less likely that critical information gets lost. Finding a mechanism that enables this, though, seems pretty complicated.

jonassmedegaard commented 2 years ago

When limited to caching, protocol-specific rules for caching must be obeyed. I.e. CacheControl header for http protocol.

It confuses me that you mention IPFS and Hypercore if what you want is to (also) cache http identifiers.

jonassmedegaard commented 2 years ago

I mean, what you can do to aid other Atomic Data servers in long-term caching your data is to add a CacheControl header with a long expiry time (and then treat that identifier as immutable for that same amount of time, obviously!).

And what you can do to cache data from external Atomic Data servers is to store locally a cached copy but only for as long as that external server signaled in their CacheControl header that you are permitted to do so.

Other protocols may have other efficient cache management, but those features are irrelevant for caching of http-based identifiers.

joepio commented 2 years ago

When limited to caching, protocol-specific rules for caching must be obeyed. I.e. CacheControl header for http protocol.

It confuses me that you mention IPFS and Hypercore if what you want is to (also) cache http identifiers.

CacheControl is not for dealing with resources that go offline (i.e. 404 / server timeout), which is the scope of this issue. CacheControl is just for increasing performance, and preventing re-fetching big documents. I havent' implemented that in Atomic Server, as the (often very small) resources themselves take about 0.2 milliseconds to fetch from disc.

Just to be clear: maybe CacheControl has its merit here, too, but it definitely does not solve the issue I'm trying to solve here: resources that go offline, and having means to find them if the HTTP URL no longer works.

jonassmedegaard commented 2 years ago

The crate hitbox and This issue tracking its integration with crate actix-web might be relevant.

jonassmedegaard commented 2 years ago

...and other more generic cache handling issues like this one.

My point being that this sounds specific to cache handling - which for http is tied to the rules for caching defined as part of the http protocol.

jonassmedegaard commented 2 years ago

I was mostly thinking about the first usecase: referenced data, where the subject remains constant.

[HTTP CacheControl] definitely does not solve the issue I'm trying to solve here: resources that go offline, and having means to find them if the HTTP URL no longer works.

When a resource disappears from a web, then either you violate CacheControl rules by continuing to serve a cached copy beyond its expiry time claiming that what you serve represents that URI, or you admit that you are service a copy of a resource that at certain point in time had a certain identity.

Might make sense to track mutable data as a separate issue, but I dare say that this is exactly about the second form in my list: read-only copy of a resource

DougAnderson444 commented 1 year ago

FYI Rust-Libp2p (networking stack behind IPFS) supports DHT:

https://github.com/libp2p/rust-libp2p/tree/master/protocols/kad

I've been familiar with both Hypercore (formerly known as Dat Protocol, now called Hole Punch) and IPFS (IPNS, IPLD, Libp2p) since 2018/2019 and they have both evolved at lot since then. The IPFS ecosystem is well funded and well supported seemingly more so than Hypercore/Holepunch so the IPFS adoption and name recognition is more prevalent, plus IPFS seems to have better browser support overall (not every user wants to download something to get started).

The very slow IPNS is due to the DHT, but the IPNS can be sped up by using pubsub and also there is a new initiative called the "Name Name Service" (NNS) which may replace IPNS in the future. Personally I am working on zk Delegated Anonymous Credentials name system which could offer a robust naming solution across mesh nodes.

Personally, after years of research and development in this area I am leaning towards nodes that can be run at home with no domain name (TLS) requirement -- which means WebRTC Data Channels over Rust compiled nodes, with the data being persisted and resolved across those nodes.