decentralized-identity / edv-spec

Encrypted Data Vault Spec
https://identity.foundation/edv-spec
Apache License 2.0
13 stars 5 forks source link

How does compression of a document play into chunking of files? #55

Open kdenhartog opened 4 years ago

kdenhartog commented 4 years ago

Since encrypted data turns data into effectively a high entropy string (here's a good link on the topic) it's often not recommended that compression be done on encrypted data. As such, how does compression of plaintext objects play into the strategies around chunking files and their sizes. @dlongley or @csuwildcat I figure you may have an opinion or have thought about this already.

agropper commented 4 years ago

Best practice would do the compression before the encryption.

kdenhartog commented 4 years ago

Yes, I understand that part. I'm less certain of how compression would affect parallelization and the benefits gained when chunking files rather than encrypting a file as a whole.

dlongley commented 4 years ago

You must compress before encrypting. If you could compress encrypted data then it wouldn't be very good encryption (it should look random)! That being said, we have an open issue on our initial implementation of an EDV client here: https://github.com/digitalbazaar/edv-client/issues/47 -- where the idea was to enable optionally specifying whether or not data/chunks had been gzipped so that they could be decompressed once decrypted.

dlongley commented 4 years ago

@kdenhartog,

I'm less certain of how compression would affect parallelization and the benefits gained when chunking files rather than encrypting a file as a whole.

Well, generalized compression works over a data size window anyway -- so it depends on the size of the chunks and the data type (as to how it would affect efficiency). I think gzip uses like a 32KB window size and chunks will likely be at least that size, so I wouldn't worry too much about efficiency issues from that perspective. I would think that parallelization wouldn't be adversely affected either. The pipeline for writing chunks would be:

  1. Read up to chunk size cleartext into a chunk to be encrypted.
  2. Optionally gzip chunk. (new)
  3. Encrypt the chunk.
  4. Store the chunk.

That could be parallelized -- both when creating the chunks and when fetching and decrypting/decompressing them.

kdenhartog commented 4 years ago

Ok, didn't know that aspect of windowing occurred when compressing data. I'll look more into that a bit so I can be more informed on the topic. In the case where I'm compressing a movie file too, I assume because the compression is being chunked up too, I could receive only parts of the movie, decrypt, then decompress, and could stream without needing the entire file to watch it. Would that be a high level view of how streaming of a movie might work out of this?

dlongley commented 4 years ago

In the case where I'm compressing a movie file too, I assume because the compression is being chunked up too, I could receive only parts of the movie, decrypt, then decompress, and could stream without needing the entire file to watch it. Would that be a high level view of how streaming of a movie might work out of this?

Yes, that's what I'd expect.

agropper commented 4 years ago

Movies are not personal data. Making extra copies into various stores might make sense. But that's not what SSI is about.

Much of my career has been in telemedicine and teleradiology including radiology image compression for commercial streaming. The use-cases we might consider in the SDS context would look like one patient's MRI, or a data stream form a policeman's wearable cam or your self-driving car.

On Fri, May 15, 2020 at 10:12 AM Dave Longley notifications@github.com wrote:

In the case where I'm compressing a movie file too, I assume because the compression is being chunked up too, I could receive only parts of the movie, decrypt, then decompress, and could stream without needing the entire file to watch it. Would that be a high level view of how streaming of a movie might work out of this?

Yes, that's what I'd expect.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/decentralized-identity/secure-data-store/issues/45#issuecomment-629258385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABB4YNTX5RFDUSA3JRHWJLRRVEVTANCNFSM4NBBA3AA .

dlongley commented 4 years ago

Movies are not personal data.

-1 Certainly home movies are personal data. Regardless, EDV/SDS is a general storage solution, it is not for "personal data only". Furthermore, there will always be disagreements over whether content of a particular sort is of one nature or another, so that can't reasonably be a core technology level concern. These concerns are important at higher layers.

csuwildcat commented 4 years ago

All data relative to a subject is personal data, and behavioral data like what movies you watch, what groceries you buy, and what hobbies you participate in are even more privacy invasive and revealing than the few types of institutional data traditionally thought of as 'identity data'. This datastore needs to support all types of data, so it can, if the user/controller desires, operate as a datastore for both 'traditional identity data' and all the other types that tend to fall under the current categories of personal application and services data.

OR13 commented 4 years ago

Perhaps we can add a note about compression to the sections regarding streams: https://identity.foundation/secure-data-store/#creating-a-stream

I would love to see some code that supported compressing / encrypting video, so far, this is the largest healthcare sample video I can find: https://pixabay.com/videos/hands-soap-virus-coronavirus-34325/

OR13 commented 4 years ago

I think we are ready for a PR which defines the relationship between compression and encryption, maybe with a note about how valuable compression is for large files / video.

kdenhartog commented 4 years ago

I still remain on the hook for this one. Hopefully I'll get to it soon and it doesn't keep getting pushed to the bottom of my ever growing and shrinking list of things to do.

tplooker commented 3 years ago

Discussed on the 24th of June WG call, suggestion was to resolve the issue by adding a note to the spec documenting that the interaction between the encryption algorithm and file compression for storage is up to the storage provider.