Discussion: options for hypercore feed-level metadata

bnewbold commented 6 years ago

Motivation: have a way to annotate the "type" of feed contents. For example, determine if you're looking at a hyperdb key/value feed, a hyperdrive, or some other thing. A requirement is that code/libraries be able to replicate the feed and discover the content type (and schema version) without necessarily understanding the schema itself. A related motivation is to discover related ("content") feeds in a protocol-agnostic manner, but this isn't a requirement.

Question: should this blob be strictly immutable? Being able to change some metadata might be nice (eg, paired feeds), but keeping it immutable is simple for, eg, hosting platforms and archives.

Option 1: protobuf message as special first entry in feed. This is basically what hyperdrive does currently to point from metadata to content feed, only we would want to use (extensible) protobuf instead of bare bytes. Could potentially select a small fixed number of fields for this protobuf schema (eg, "repeated relatedFeeds bytes", "optional protobufSchema String", "optional contentType String"; strings could be mimetype-like), which application could extend upon.

Option 2: add a metadata/header blob out-of-band to hypercore feeds. @mafintosh mentioned a scheme where an immutable blob is transmitted during feed handshakes, and the hash of that blob is used as a key for internal hypercore hashing. Would be stored as a new stub file in SLEEP directories (like feed key is currently).

There are probably more options if we get creative!

mafintosh commented 6 years ago

Been thinking a bit about the handshake message. Instead of making it the first message it could be powerful if you could update that message. Especially if we have an "relatedFeeds" scheme since you most likely wanna update that over time (hyperdb does this a bunch!)

What about something like this

message Header {
  message Feed {
    required bytes key = 1;
  }
  required string protocolType = 1;
  optional uint64 version = 2; // defaults to version 0
  repeated Feed relatedFeeds = 3; // use a Feed message so users can extend it with other metadata
}

And then a convention that every message points back to the last header message

message Entry {
  optional uint64 headerSeq = 4;
}

Or some variation of this

mafintosh commented 6 years ago

@bnewbold expanding on the above... what if you could attach an immutable blob and a mutable sequence pointer, or simply just a mutable sequence pointer. Then the header would be stored in the feed but you'd keep the "latest header" pointer outside

mafintosh commented 6 years ago

Another idea.

Being able to attach a mutable blob that is history less. Meaning that in the handshake or somehow each peer exchange (blob, blobSeq, signature).

blobSeq increments everytime the owner updates the blob and is used to pick the newest one if the peers disagree on which one is the correct one. The owner also signs it

pfrazee commented 6 years ago

Being able to attach a mutable blob that is history less. Meaning that in the handshake or somehow each peer exchange (blob, blobSeq, signature).

Sounds like a good solution to me

mafintosh commented 6 years ago

Been thinking about the security aspects of the mutable header. It becomes a bit tricky fast, imo. I want to avoid situations where peers can withhold the latest blob which means we need to add it to the merkle tree which makes the no-history aspect of it hard (open for ideas here!).

Going back to pragmatism what is it we can to get out of this? Originally my thoughts about the immutable header was you'd include a string like this

hyperdb/v1

Or

content-feed

Ie. immutable descriptions of the data that let's you pick the right strategy to parse the data.

The main thing gained from the mutable one would be if we could specify which feeds to crawl (makes archivers easier over time), assuming we spec out a required schema for the handshake on top. This ofcourse could be a massive benefit as well. Unsure how to proceed, again open for input.

pfrazee commented 6 years ago

Could an immutable blob which identifies the data structure then use custom headers to identify additional feeds? Then that custom header would be part of the data structure and could be made mutable

mafintosh commented 6 years ago

@pfrazee yea that's what i've been thinking too. use the immutable string to pick the right strategy to crawl the feed (default to contentfeed which means no crawling).

pros

easy to impl
backwards compat (old cores would just have the header '')
easy to review sec wise.

cons

means archivers need to know about datastructures to crawl them

bnewbold commented 6 years ago

As per the original message, I think there are sort of two things going on here.

The first is for all clients/readers/infra/etc to be able to quickly get from a bare dat:// URI to knowing what "type" of content the feed has, at all. The analogy to me is the Content-Type HTTP headers, which comes before any content and can be fetched with a quick HEAD (don't need a full GET). In dat-land, the dat CLI should be able to see "this isn't a legacy hyperdrive or a hyperdb-style hyperdrive, so i'm just going to bail", or it selects the appropriate code path to continue with. My assumption is that this is immutable (tied to the feed as a whole at time of creation), but maybe i'm wrong.

The second is the ability to associate generic metadata with a feed, sort of a key/value sidecar to the feed contents proper, which might include related feeds or anything else. We sort of do this with hyperdrive-like feeds via dat.json, but it might be nice to have this for any feed.

I think the first is more urgently needed for hyperdb+hyperdrive roll out. I propose we focus on a solution to the first part, but not include any "related feed" functionality in it, because that is more "mutable". I think a mutable solution to the second bit would probably be good... but I also think more thinking is needed.

In either/any case, off the top of my head I think we should keep all such metadata "in band" in that the same hashing/merkle structure should cover the metadata as well as feed content, so we don't need to add additional verification complexity.

bnewbold commented 6 years ago

Just a ping that I think we want to make progress on this in the next week or so. What would be the best next step? A specific implementation proposal?

bnewbold commented 6 years ago

This announcement about git wire protocol v2 has some details about how they shoe-horned in a protocol version flag: https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html

(this message is really a poke at @mafintosh to write up what we discussed last week)

bnewbold commented 6 years ago

In hyperdb v3.0.0, @mafintosh added a minimal protocol header as the first hypercore entry, with protobuf schema (https://github.com/mafintosh/hyperdb/pull/121):

message Header {
  required string protocol = 1;
}

and hyperdb sets the protocol string to hyperdb for now. It's not clear to me yet what hyperdrive-on-hyperdb will do.

Frando commented 5 years ago

I actually think that we should adhere to the layered nature of the hyper* tools, which would mean that it does not make sense to state that a hypercore is a hyperdrive, but only that a hypercore is a hyperdb, and then set on the hyperdb level that it's a hyperdrive.

So at the hypercore level: Header "hyperdbv1", or "hyperdbv1-content", because a hyperdb may also have a content feed which is a hypercore, but with a different data structure from a hyperdb.

And then, at the hyperdb level, I propse that we have a single special reserved key that stores some JSON to set more properties. So e.g. /:meta or similiar. There, it would say {type: 'hyperdrive', version: 'v1' }. That meta key could also be the place to store mount information (should we decide to implement it at the hyperdb level) or different value encodings per prefix (should we decide to support subhyperdb natively, to e.g. have a part of a hyperdb be a hyperdrive and another part a json key value store).

bnewbold commented 5 years ago

Hi @Frando! Thanks for the feedback.

We've gone back and forth on this a few times; i'm not sure all the history is in this issue thread. There are advantages to what i'd call the "recursive" approach you mention (typing at each layer of the stack): tooling can fall back to partial support of higher-level protocols (eg, inspect hyperdb even if hyperdrive isn't supported), etc. Some of the trade-offs that pushed me over into the single-top-level-string camp are:

more complex recursive code is needed to determine the application-layer type (which many tools would want to display to the user, eg hashbase). Importantly, checking the type becomes a less-deterministic operation (with the single string case, it's just a single element to be synchronized; with hyperdb a recursive lookup needs to be done to discover the correct key/value pair)
immutability of content type as a feature, not a bug
backwards compatibility (huge!)
don't want to burden every container format with needing to include "next" level type metadata. AKA, the application-agnostic /:meta key isn't very elegant to me, and potentially constrains use cases that would want to make every key/value semantically meaningful. Would this one value always be JSON or protobuf, regardless of the other value encodings? All the same debates we've had with this header decision, with each content data structure. Not insurmountable, but if we can keep it simpler that seems better.

In the end, this boat has basically sailed, in that DEP-0007 got published. We can leave this thread open a little longer if you have more comments, and then close.

dat-ecosystem-archive / DEPs

Discussion: options for hypercore feed-level metadata #13

pros

cons