Securing a Dat with a - additional - private key (password)

martinheidegger commented 6 years ago

When implementing a backup/public storage (like hashbase or datbase) for DATs that storage knows the content of the DAT. In my understanding, right now the only way to make sure that the storage does not know what is inside the dat is to encrypt the files in the storage additionally by packing the data in a .zip file. The problem with this approach is that it is not-at-all transparent. The sender needs to know and care about zipping and so does the recipient. Also both parties need the same zip program installed (funny sidenote: japanese tend to send out shift-jis encoded zip files) and know how to use it. Aside from knowledge and installation issues, its also significant amount of overhead if you do that often and reduces the comfort of using dat.

I thought about implementing an transparent-ish wrapper on top of hyperdrive that - instead of writing directly to the stream - write everything into a .dat-encrypt.zip file that is encrypted with a password and upon receiving a DAT that contains only a .dat-encrypt.zip file it automatically decrypts it.

This approach would be sound, but unfortunately DAT - as it is built right now - only lets you upload/download the entire zip in one run. Which means that any additional file would trigger a complete re-upload and re-download - consuming vast amounts of bandwidth 😟and sacrificing a big part of the value of having DATs. Maybe that is important in order to ensure actual privacy of the content.

This all leaves me with a few questions:

Are there other ways to achieve that?
Should the encryption layer be implemented?
Should this encryption this be part of hyperdrive? dat-node? or a implementation on top?

joehand commented 6 years ago

May be of interest: https://github.com/jayrbolton/dat-wot

martinheidegger commented 6 years ago

@joehand Thank you for the hint, but from what I can tell dat-wot only manages who knows about what dat link and makes sure that every user gets to see only dat-links he/she is supposed to see. That is certainly a nice workflow and concept but it doesn't make it possible for a intermediary to store/cache encrypted data.

creationix commented 6 years ago

How private do you want your data? For example, you could encrypt file contents, but not the file names or directory structure. Each file could be encrypted with some master key hashed to the path (don't want to use the same key for all files).

Another option is to store the data in some container that has multiple files. I've stored files in a git repo (blob and tree objects) which maps to a flat list of hashes. You only need to point to the root hash to read the tree. The real filenames are quite private since the tree objects are also encrypted. Here you could hash the master key with the content hash.

Another option is to store the files in a filesystem image (like ext4) and store the image in 4k blocks (named by index). You can use a stream cipher since the blocks are ordered if you plan on extracting them at once or hash the index to the master key if you want random access using fuse or something to mount the block device.

Also depending on how custom you want to go, only hyperdrive needs to be used for the dat protcol to allow syncing data around. Hashbase can store any hyperdrive based dataset, it just won't render as http if it's not hyperdrive on top.

I've got lots of ideas, but need more information about your requirements.

martinheidegger commented 6 years ago

Oh, this is really inspiring!

The solution I am imagining right now - based of what you wrote - would be to have an encrypted file table .dat-encrypted which grows in 4k blocks (to avoid knowing the exact size of files in the dat. Every time files are written their information is added to .dat-encrypted and they get encrypted themselves into blocks 0, 1, 01, etc. Old blocks that don't show up in the current table can be deleted and each block is encrypted. This way streaming, in a sense, could still work.

creationix commented 6 years ago

I don't quite understand what you're proposing, but it sounds like it might work.

creationix commented 6 years ago

I'm just going to link to this old experiment of mine to show some of what I meant https://github.com/creationix/test-workspace

martinheidegger commented 6 years ago

Let me try to rephrase: Instead of writing into .dat-encrypted, data get written into blocks of files. Those blocks have all the all the same size and are encrypted, so unless you know the meta-information you can't figure out what files are where, but you can download parts. The meta-data gets encrypted as well and is written in the key. So: with every new version you need to download the meta-data but you then can decide what parts to download.

dat-ecosystem-archive / datproject-discussions

Securing a Dat with a - additional - private key (password) #80