Distributed replication status

martinheidegger commented 6 years ago

From a user-perspective of DAT, one thing problematic at the moment is that replication feels not certain/stable. The green bar shown in DAT-Desktop only signals if its replicated once and how many peers are listening. It doesn't mention what nodes are connected (particularly, a backup-node) or whether all of the peers replicated all data of all versions. Also it doesn't mention if second level or higher peers (nodes only connected to a peer, but not to the original client) have replicated data, which might very well happen if a node is disconnected for a while.

One approach I could think of:

When a client seeds and/or receives a dat the first time they create a new Dat that is used to transmit the distribution status.
The client uses a Message protocol (#62) to tell the peers the dat-key it uses to store the progress.
One download it stores the bitfield of the downloaded data in the specified dat.
Once another peer tells the client about that other peers progress-DAT, the client notes that peer in its own distribution status.
The client also replicates the other peers status and acts as a seed.
(optional) The client can also add other data (such as labels or identification) in the progress-DAT to identify itself.

The Problem with this approach would be that every client would need to connect/store every other clients data. Which - even assuming sparse downloads - could mean a explosion in dat connections and data downloaded. Also: malicious clients could store malicious data (i.e very big) in the progress-data; Which could probably prevented with a size-limit?!

Now, with that in mind: How would you solve that? Would this make sense at a lower level?

okdistribute commented 6 years ago

We accomplished something like this with https://github.com/karissa/hyperhealth . Everything you need is in hypercore already

okdistribute commented 6 years ago

Found the mafintosh ui for this: https://github.com/mafintosh/hypercore-stats-ui

okdistribute commented 6 years ago

and the @joehand one: https://github.com/joehand/dat-stats

martinheidegger commented 6 years ago

@karissa Thank you for your input, I have read through most of it but to my knowledge none of those actually take care of the issue I tried to fix:

something like this with https://github.com/karissa/hyperhealth

It does accumulate the health of the dat but only for the connected nodes, not for disconnected, former nodes. I.e. you don't get to see the health of a DAT once you lost connection.

the @joehand one: https://github.com/joehand/dat-stats

Doesn't that just expose your other package that I didn't know about? https://github.com/karissa/hypercore-stats-server

It looks like both of those are also using a quite old version of dats.

ui for this: https://github.com/mafintosh/hypercore-stats-ui

This is a beautiful UI, indeed I havn't seen it before, and while it is totally awesome & I gotta need to read the source code of it I don't see how (like mentioned above) it would be able to show the actual replication state, rather than the "currently connected replication state".

joehand commented 6 years ago

@martinheidegger we are trying to work through some of these issues in the CLI too, but we are focused more on the uploading side (trying to answer, does my server have all the data and can I close this process/close computer?).

To clarify that I understand, it seems to me there are a few discrete pieces to solving the download side of things:

Am I connected to the "owner"?
- Yes - do I have latest metadata and content?
Not connected to owner
- Am I connected to a "cloud" type peer (maybe I can tell by if they have 100% of the blocks)?
  - Yes - I can assume I am mostly updated, based on last time owner was online.
No owner, No cloud Peer
- This is a poor health Dat - uncertain replication status.
- This case is the hardest one, but I am not sure how much work we should put in designing around this use case until it presents itself as a regular problem.

Somewhat along these lines, https://github.com/joehand/dat-push is probably the closest work I've done on this. The hardest part here is figuring out what "Done" means when pushing. For example, a user may be sparsely replicating. I may have pushed just 5 blocks but that is all the user wants, so I should say a push is "Done". But that is hard to tease out =).

martinheidegger commented 6 years ago

trying to answer, does my server have all the data and can I close this process/close computer

This is a tricky question because you don't know in the CLI what the user intended to upload. Maybe she only wanted to provide the dat so some other client can get some sparse data? Maybe she wanted two peers to take the some data of the current version? Maybe she wanted to trigger a backup process? Maybe a separate command might be a good idea

$ dat push-backup

Then the CLI could seed until one connected client has the whole bitfield of the version locally replicated. To establish that is different than to establish if the bitfield was ever replicated somewhere.

One bit can have multiple states:

Not replicated anywhere
Replicated by X peers
Seeded by a peer (just because a peer replicated it, doesn't mean that peer seeds it)
Replicated by X peers that are now offline
Last seen (last-seen timestamp for that bit)

Both for upload and download it seems like it is necessary to know this for the entire bitfield in order to decide what to upload and what to download.

maybe I can tell by if they have 100% of the blocks)

100% of the blocks of the current version, right? What I am trying to get at is: Different cloud-types may have different versions, our download-peer is looking for the last-know-fully-replicated version in the swarm as one peer could have 50% of the latest version but another one could have 100% of the version before that, then it should download the 100% of the former version.

dat-ecosystem-archive / datproject-discussions

Distributed replication status #81