cloudflare / utahfs

UtahFS is an encrypted storage system that provides a user-friendly FUSE drive backed by cloud storage.
BSD 3-Clause "New" or "Revised" License
815 stars 49 forks source link

Deduplication #11

Open ncruces opened 4 years ago

ncruces commented 4 years ago

Given the intended security properties, how do you feel about data deduplication?

Would the ORAM implementation help with the information leakage inherent to convergent encryption?

Bren2010 commented 4 years ago

I think that data de-duplication between many users that are unaware of each other is probably too severe an amount of leakage, just on principle.

But also I don't think it's generally possible to combine ORAM with storage-provider deduplication in the way you're thinking, primarily because: ORAM relies on being able to read a block of memory and then re-write the same thing, such that the provider can't tell that the same content was just written back. This is done by re-encrypting with a new random nonce.

So I think the actual solution to de-duplication (within a single UtahFS archive) wouldn't be cryptographic at all. You would add a content-based addressing layer "on top" of UtahFS. Which would be cool, but it doesn't increase or decrease the value proposition of ORAM.

agowa commented 4 years ago

I think it was meant to do deduplication within UtahFS. This would leak the fact that no block is dupplicated to the underlaying storage. Even though it would use ORAM, an attacker would know that because of the deduplication each block only occures once on the backend. So in the case where an Attacker can observe the backend and also provide paintext input data but does not see the unencrypted content, could gain an advantage because of this. By guessing files that could exist and uploading them. If they would already exist within the UtahFS mount, there wouldn't be new data written to the backend, even if ORAM is enabled, the total allocated size at the backend would not change and therefore leak this fakt.

But despite that I think this feature would be useful especially to those not using ORAM to save money on the backend S3.

ncruces commented 4 years ago

If by storage-provider deduplication you mean between many users that are unaware of each other, that was not my intention. But that's my mistake, I lead the conversation there by mentioning convergent encryption, which on second thought doesn't make any sense here. Sorry for adding noise.

I meant a single user that stores heavily duplicated data in its own volume. Which as you've said could maybe be implemented on "top of" UtahFS?

It seems to me that if it is possible to marry deduplication in that scenario with the current properties of UtahFS, ORAM will be helpful at further hiding the redundancies in the data. No?

Again, sorry for wasting bandwidth if this is just a stupid question. Feel free to close!

agowa commented 4 years ago

ORAM will hide "what you store", even though I think that if UtahFS gets a higher adoption rate it is quite obvious that you use it, as it might be the only thing that uses ORAM... Than if deduplication is provided as a feature it might as well be a very good guess to assume most people are using it to reduce costs on there S3 buckets. This leads to a high certainty for an attacker about this. And you'll end up with the scenario I mentioned earlier.

But: At least for me that would be acceptable for storing backups, as I'll not be using ORAM anyway as that would get really fast really expansive.

ncruces commented 4 years ago

The point is not necessarily to hide that you're using UtahFS (is that even a goal for the project?), it's hiding what kind of data you're storing in your volume. Simply encrypting content and/or metadata is often not enough.

You can encrypt those and still leak whether you're backing up photos or movies, vs. PDFs, logs, git repos, VM snapshots, etc. Access patterns will be very different for each of those, depending on the size of files, whether it is write once read many, write once read never, write many, etc. ORAM is supposed to help here. My question is if ORAM might also help hide access patterns related to deduplication of heavily deduplicated data.

agowa commented 4 years ago

ORAM is supposed to help here. My question is if ORAM might also help hide access patterns related to deduplication of heavily deduplicated data.

This was exactly what I was trying to say. It will hide the data, but if the feature exists and the attacker knows that you're using UtahFS, it is a very good guess to assume that you use deduplication. Because of the nature of deduplication it therefore could cause problems with the encryption, as it is than reasonable to assume that there will never be two identical blocks.

It will have the same consequences as any deterministic encryption schema (with ORAM disabled).

With ORAM enabled it is slightly better, as it rules out most known plaintext attacks, with one notable exception. The total size of the bucket will not change if the attacker provided plaintext already is within the bucket. Therefore in the case of an attacker being able to provide the plaintext and monitor the total bucket size it will inherently leak the fact that that file exists (assuming an only growing bucket).

This could than again be mitigated if also delete operations happen (e.g. Archive mode is not used), as because of that there will be unused blocks that UtahFS can reuse. Resulting in the bucket not always growing if a new file is added to it. But than in turn if an attacker can provide plaintext and observe the bucket size at the same time, the attacker could as well just start by filling up all such blocks e.g. start by writing known (random) data (excluding the data/pattern that should be checked of course) to the bucket until it starts to grow and than inject the files/blocks to be "checked for there existence".

ncruces commented 4 years ago

Either I'm missing something, or we're digressing.

If it's a single user volume, and deduplication is only applied to that single user's data within that encrypted volume, how would an attacker provide plaintext that is already within the bucket?

I know I'm entirely to blame for the confusion by mentioning convergent encryption. That makes zero sense here. But I'm definitely not interested in helping the storage provider do deduplication between mine and others' data (and leaking information that way).

I'm interested in saving myself money when I store heavily duplicated data, and still hiding the fact that that is (or isn't) the kind of data I'm handling.

agowa commented 4 years ago

I think we're clearly talking past each other. I'll try one last time.

If it's a single user volume, and deduplication is only applied to that single user's data within that encrypted volume, how would an attacker provide plaintext that is already within the bucket?

Well, only because it is a single bucket does not mean, that an attacker could not potentially provide plaintext. e.g. If you use it as the storage backend for your web application, or similar.

If you use it solely as "dropbox" replacement that thread is out of scope for you than.

I'm interested in saving myself money when I store heavily duplicated data, and still hiding the fact that that is (or isn't) the kind of data I'm handling.

As long as you do not have attacker provided data and also not use archive mode I do not see a big problem.

ncruces commented 4 years ago

I see. Finally got it, thanks for taking the time to explain it. I wouldn't use this as a storage back end for user provided data but that's a good point.

In that case, I don't think it's common that a single user is (supposed to be) able to observe the total size of the bucket. So, as you've said, ORAM helps. One important caveat is the storage provider is (or colludes with) the active attacker.

agowa commented 4 years ago

I wouldn't use this as a storage back end for user provided data but that's a good point.

You won't, but others might.

In that case, I don't think it's common that a single user is (supposed to be) able to observe the total size of the bucket.

Open Grafanas or Promethium instances are a thing (sometimes intentionally like at https://freifunk.fail/ ), so your storage provider does not necessarily need to contribute...