dat-ecosystem-archive / DEPs

Dat Enhancement Proposals. Contains all specs for the Dat protocol, including drafts. [ DEPRECATED - see https://github.com/hypercore-protocol/hypercore-proposals for similar functionality. More info on active projects and modules at https://dat-ecosystem.org/ ]
https://dat-ecosystem.github.io/DEPs
167 stars 17 forks source link

Zero-knowledge remote attestation (for sensitive institutional/archival dumps) #24

Open bnewbold opened 6 years ago

bnewbold commented 6 years ago

This may be most interesting to institutional/organizational/repository/archival folks, though it might also be of interest to anybody operating or relying on a hosted "pinning" service (like hashbase.io).

Yesterday in a conversation I was talking about how a network monitor might want to "trust by verify" the archival status of a Dat/hypercore peer by fetching random chunks of a feed to ensure that the peer actually has them. As some context/motivation for this use case, imagine the USA EPA is hosting a large dataset in Dat. Their peer would advertise that they Have all the chunks of data; a skeptical peer (eg, a repository accreditation agency, or journalists, etc) might want to verify that they actually do still have all the data on disk. Another use case would be an individual user paying for really cheap cut-rate backups of all their cat photos; a nervous user might trust the host but want to verify that they do indeed still have all the data and it hasn't become corrupted, particular in the moments before deleting their own local backups (eg, to make space on their laptop disk). A more malicious example would be a well-funded peer trying to censor a public dat archive by creating thousands of fake/dummy peers claiming to have the full archive, but then stalling before actually returning result, resulting in most clients connecting to these "sibyl" nodes and timing out over and over (and thus failing to sync the content). As a final corner use case, consider a hospital storing sensitive private data, and synchronizing it to other hospitals for backup (or when a patient moves); a third-party might want to verify that the data is all there (and hasn't suffered, eg, bulk disk corruption, which a hospital might not even notice if they aren't continuously re-hashing their contents), but not want to actually transfer any of the data (because that would require HIPAA compliance in the USA); there are similar circumstances (financial data, personal private data, etc), where an observer might want to verify

These are not hypothetical concerns: accredited data repositories need to have a workflow in place for this sort of third-party verification, the LOCKSS network has an entire protocol for secure peer verification, and the IPFS FileCoin protocol depends on verification.

In all these cases, a naive mechanism to "check" the remote status would be to download random metadata and content chunks, rehash, and verify the signatures. By having the user chose chunks at random, it isn't possible for the remote to fake the results; they would effectively need to retain the full content.

The person I was talking to reminded that it's possible to do even better: the entire chunks don't need to be transferred over the network, if some clever crypto is used. In a simple case, if the "monitoring" peer also has a complete copy of the content, it can create a random number, use it as a "salt" in hashing a random local chunk (or even the entire content feed!), and send the random number to the peer being tested to do the same operation, then compares just the resulting hash. In this case no actual content has to travel over the wire. I think there might be an even more clever way to do remote checks where the "monitoring" peer doesn't have a complete copy of the content locally (only metadata), but I would need to research this.

This might be a useful extension to the hypercore protocol some day; would probably be two new message types (one to send a verification request, one for the response). I (Bryan) don't have any plans to work on this in the near future, but wanted to put this down in words.

pfrazee commented 6 years ago

Solid writeup @bnewbold, thanks for doing it. My one question is, what sort of monster would cheap out on their backups of cat photos!?

emilbayes commented 6 years ago

I did play around with this idea too. Here are some notes: https://github.com/emilbayes/hypercore-proof-of-data

I have some code on my laptop too, but got bogged down by trying to make a hypercore extension

jedahan commented 5 years ago

@emilbayes

A peer may fetch the data as they are requested to solve a challenge. This may be possible to mitigate by setting a low timeout, but with the risk of failing a peer due to high latency

I actually see that as a decent thing - even if a peer has some data, if it is inaccessible, what good is that? There is room for peers to decide for themselves what threshold they are happy with, and seems also a partial solution to the M/N problem.

okdistribute commented 4 years ago

Finally starting to work on this, is this something the archive would still want @bnewbold?

bnewbold commented 4 years ago

@okdistribute I don't think this is currently on anybody's radar here.