RangerMauve / dat-store

A CLI tool for backing up hyperdrive datasets
GNU Affero General Public License v3.0
59 stars 14 forks source link

Add content-type specifier for dat store #54

Open martinheidegger opened 3 years ago

martinheidegger commented 3 years ago

Looking at #46 and thinking of multi-hyperbee or cabal it may be a good idea to specify the type of a given hyper URL, (if not specified outright in the URL):

$ dat-store add --type=multi-hyperbee hyper://abc..def
$ dat-store add --type=core hyper://abc..def
$ dat-store add hyper://url/abc..def

Could this be something worth working at?

cc. @urbien @serapath @cblgh ?

cblgh commented 3 years ago

i think it would be a non-trivial undertaking to add cabal support (dat-store would have to at the very least add multifeed?) it would be better to consider the perspective of other projects :)

RangerMauve commented 3 years ago

I was thinking that the type would be detected via the Header messages at core.get(0) of a hypercore. What sort of changes would different types give? Different behavior when we're downloading just the latest data?

Regarding multifeed, it'd be really hard to add because we'd need to have a separate swarm for it and a different way of doing storage. 😅 I'd like to stick to using vanilla hypercore-protocol without extra wrappers if possible.

urbien commented 3 years ago

Header metadata is a good way to deal with the type of data store. We added support for them in multi-hyperbee. But note that Header metadata is glitchy, it changes the order of events when store is initialized if I recall. So we had to workaround some abnormal behavior. I talked to Maf on that but I do not think he was aware of the problem then. It could have been fixed since.

serapath commented 3 years ago

i think it would be a non-trivial undertaking to add cabal support (dat-store would have to at the very least add multifeed?) it would be better to consider the perspective of other projects :)

It might not be trivial, but I think cabal (and i assume peermaps too?) will use it, and i'd find it a bit sad if the ecosystem would not be compatible and fall apart even further instead of reversing that somehow

so i read through the cable protocol and opened an issue with additional questions. The list of goals looks like its already supported by hypercore, but I assume I'm missing lots of important points. what was the main motivation to do the new cable protocol?


Below I try to summarize my understanding of both protocols and I make an additional comment with thoughts

...I guess others know most of this much better than me, so I'd be very happy if you could correct me or add additional information so that i can learn :blush: (or suggest formatting improvements)


A. hypercore protocol messages

loosely based on hypercore source code and https://datprotocol.github.io/how-dat-works

  1. wireprotocol = message | wireprotocol
  2. message= len_of_rest + channel_and_type + body
  3. channel_and_type= channel_number + message_type
    • channel_number is for multiple "dats"
    • message_type is for one of the below listed types
  4. body= fieldtag + content
    • fieldtag= fieldnumber + fieldtype (expressed in a varint)
      • fieldnumber= e.g. 1=discoveryKey (32bytes), 2=nonce (24bytes), ...
      • fieldtype e.g.
        • "1=varint": content= unsigned_integer
        • "2=length-prefixed": content=varint_length + bytes_of_varint_length

example: (a beginning of a

So given the feed message type, a "channel" [ch.localId] can be associated with a feed to then request chunk ranges for that feed later, right?

listed types


cable protocol messages

my open question issue: https://github.com/cabal-club/cable/issues/1

  1. wireprotocol = messageA | messageB | wireprotocol
  2. messageA= msg_len + msg_type + random_request_id + body (for request and response msgs)
    1. random_request_id= <...roll some dice...>
    2. msg_type`=
    3. body`=
  3. messageB= pubkey + signature + hash_link + post_type + timestamp (for post msgs)
    1. pubkey= public key
    2. signature= signature
    3. hash_link= link to hash of a previous message
      • Most post types will link to the most recent post in a channel from any user (from their perspective) but self-actions such as naming or moderation will link to the most recent self-action.

    4. post_type= <one of the types listed below>
    5. timestamp= <timestamp>

listed types

  1. hash response (msg_type=0) send a list of message hashes to request_id
  2. data response (msg_type=1) send messages to request_id
  3. request by hash (msg_type=2) request list of messages by hashes
  4. cancel request (msg_type=3) cancel any request_id
  5. request channel time range (msg_type=4) get all message hashes in time interval (+ max limit)
  6. request channel state (msg_type=5) get or subscribe to state change messages (0+ PAST, 0+ LIVE)
  7. request channel list (msg_type=6) get a list of all channels (=topics) from peers
  8. post/text (post_type=0) post text message (+ channel & timestamp & hashlink)
  9. post/delete (post_type=1) request deletion of a previous message
  10. post/topic (post_type=3)
  11. post/join (post_type=4)
  12. post/leave (post_type=5)
  13. post/info (post_type=2) post update to ones own key/value "store"
    • keys with defined special meaning:
      const state = { // special meaning:
      name,    // handle to use as a pseudonym
      blocks,  // json object mapping hex keys to flag objects { reason, timestamp }
      hides,   // json object mapping hex keys to flag objects { reason, timestamp }
      max_age, // string maximum number of seconds to store posts
      }
serapath commented 3 years ago

some of my observations and thoughts:

:sweat_smile: i probably miss the point and don't understand how important the request_id based lookup of things would be incompatible with things nor how e.g. i2p ties into all of this, or maybe they are the same. I probably lack a lot of context, but at least if someone could give me that context so i could learn, i'd be really happy :blush:


Ok, I imagine both would use hyperswarm and a swarm topic (alternatives might work too) to find other peers and then once some peers are found for a given topic, the rest can start by exchanging messages with around a dozen different possible message types.

my secret hope is all message sent by a single specific sender with a specific pubkey can still be stored in a hypercore and that messages would allow to derive what hypercore they belong to or at least discuss or see how far in that direction it can go or if there are certain reasons prevent that, i'd love to learn about it too.

hypercore:

cable:

hypercore vs. cable

  1. Is or could the pubkey of a peer included with each post maybe be a hypercore address of the sender?
  2. Then all messages of a sender could also have chunk indexes in that senders hypercore
    • an implementation could help to lookup hypercore indizes based on a message hash_link
    • the timestamp included in each message could be extended with the chunk index (vector clock?)
    • even the signature might be skipped to save space and use senders hypercore to merkle verify instead?
    • the hash_link could also theoretically be replaced with a chunk index + posters hypercore address
    • => overall, the signature in each message would be replaced with 2 indexes:
      • one added next to timestamp as an index in the pubkey hypercore of sender
      • one added next to hash_link replaced by pubkey2 hypercore as an index in the hypercore of sender of referenced message

message size comparison

  1. hashlink or pubkey are both 32 bytes, right?
  2. two additional indizes add a bit of size, but less than the signature which can be removed
  3. peers arround a topic and channel are supposed to have all the messages regardless of how they are stored
    • so an implementation could look them up quickly and send them out to the requester
  4. all self actions could save the entire pubkey2 and only use an index, because pubkey2=pubkey
  5. also - in a request or response of a list of hashes, all hashes with the same sender could be replaced by a single pubkey address + indexes and/or index ranges to save additional space, or not?
  6. another thing is more a question about post/info message where the value is very large - wouldn't it be better to chunk that up into multiple messages like hyperdrive might do it?

perf comparison

  1. merkle verification with a signature is slower than simple signature verification though
  2. but multiple messages from similar senders could be batch merkle verified to save signature verifications?