Clarify that Poplar-like protocols may leak sensitive information if the collector misbehaves

simon-friedberger commented 9 months ago

The collector decides which parts of the IDPF tree to explore. This not only leaks prefixes of the heavy hitter values but, if the collector decides to explore other branches, possibly the values of individual clients.

This is pointed out in BBCGGI21:

A malicious server’s only strategy to learn extra information in Protocol 5 is to manipulate answers to the prefix-count oracle queries using an “additive attack.” [...] Intuitively: the adversary can essentially control which strings are heavy hitters (and can thus learn how many honest clients hold strings in a small set), [..]

In the current DAP implementation this is attack is trivial to execute for the collector since it even sees the values in plain. This risk should be clearly pointed out in the standard.

simon-friedberger commented 9 months ago

While this clarifies the threat in the text. It leaves a problematic situation where there is no clear argument why anybody should use Poplar instead of a much cheaper shuffling based solution.

cjpatton commented 9 months ago

Thanks for flagging, Simon!

divergentdave commented 9 months ago

I think Poplar1 still has a good case if it is deployed with differential privacy noise added at the aggregators. This would have minimal impact on the utility of the heavy hitters results, while protecting the privacy of rare measurements.

My initial hunch is that we'd want to sample DP noise from either the exponential or Laplace distribution, and add that to the shares of counts for each candidate prefix.

simon-friedberger commented 9 months ago

@divergentdave But doesn't the same DP countermeasure apply to shuffling?

divergentdave commented 9 months ago

I think that without functional secret sharing, you would have to trust the server that decrypts inputs received from the anonymizing proxy to actually apply the DP when counting occurrences, and keep the raw inputs private. Whereas with functional secret sharing, we can implement the addition of DP noise in a way that only needs one honest server out of two.

cjpatton commented 8 months ago

2024/1/25: We believe we can mitigate this threat by extending VDAF to allow enforcing Algorithm 3 in the Poplar paper:

simon-friedberger commented 7 months ago

I need to pick nits, sorry. That article has a query oracle so it cannot be implemented. The proposal was for the collector to tell the aggregators the number of reports in each tree branch so the aggregators can at least prevent the collector from exploring more leafs than the number of reports and the size of the anonymity set permit.

Example: With 100k reports and an anonymity set of 1k the collector can explore 100 leafs of his choosing instead of the entire tree. This still allows getting the 100 rarest values which probably are the values of individuals.

schoppmp commented 7 months ago

I agree that without Differential Privacy, this attack allowing the malicious aggregator to explore N/T paths of its choice still exists (N=#inputs, T=heavy hitter threshold). It would still exist even if we made the collectors commit to their aggregate shares before exchanging them, since the malicious aggregator can use prior knowledge about the weight of paths to consistently modify its shares.

I believe the best thing we can do is what @cjpatton suggested, i.e. implement the algorithm where the total count of the children at each expansion is capped by the count of the parent, make it clear in the security considerations that this still allows the attack you mention, and suggest Differential Privacy as a mitigation for that.

cjpatton commented 7 months ago

Concretely, it sounds like what we need to do is pass the previous aggregate result to Vdaf.is_valid() so that the Aggregators can enforce this. Does that sound right, @schoppmp ?

wbl commented 6 months ago

So I was a bit confused by what gets revealed and I think there's a very powerful attack here.

Let's say the strings are 6 bits, and we have 100110 reported 128 times, 101010 reported another 128 and 110010 is some string an attacker would love to determine that appears once 1. Critically the attacker knows that it starts with 110. We'll pretend nothing is known about the whole number of things/there's enough stuff getting dropped. We'll say that the threshold for discovery is 64, and we're holding two prefixes at once so each query is 4 strings, and the top two survive.

Attacker reports bit 1 happens 4097 times, and 0 0. Next round we ask about 10 and 11. Now the attacker starts cheating. 10 is 256 as honest, but 11 is 3841. The next round is 100, 101, 110, and 111. The attacker doesn't boost 100 and 101 which are both 128, and then 110 gets boosted to 1921, while 111 is left at 1920. Now the next four are 1100, 1101, 1110 and 1111, and the attacker boosts so that 1100 is 961, while the others only 960. Whatever tiebreaking happens 1100 survives, and next 11001 and then 11010.

In fact the attacker can do this blind: just by splitting their addition in half every time across the prefix they get the information. And they (unless I'm missing something) see the resulting string. So its not learn how many clients have a string in a small set but recover strings in the set at will if you know enough to pick the one you want to start with. Application to URLs with tokens in them is left as an exercise.

Unless I'm really missing something I think submitter privacy is still preserved, but this is still really bad. (Edit: this is be what @simon-friedberger meant in the PR, but it wasn't clear to me how powerful it was from the bug)

schoppmp commented 6 months ago

Hmm, I don't think what you describe generalizes to longer strings, assuming the mitigation I described above is implemented.

If I understand correctly, the goal of the attacker in your example is to discover an unknown string with a known prefix, that appears only once (say, a URL with a security token in it). It does that by putting as much "weight" as possible on the known prefix, by actively changing the output of the corrupted aggregator. Now at the "unknown" part, it starts splitting the weight in half at each level, exploring the path with the greatest weight (since that's where the victim's string is).

In your example, the total weight the attacker starts with is 4097, which would be enough to make the entire tree consist of heavy hitters (2^6 * 64 = 4096). This won't work for longer strings, since the honest aggregator can abort if it sees a total weight more than the number of client inputs. So to discover a uniform length-L token, the attacker would need at least 2^L clients / total weight to start with.

That said, I agree that even with mitigations in place, Poplar reveals far more than only the heavy hitters (even passive adversaries learn the prefix tree. Actively changing the prefix tree is what enables all the attacks in this thread). I will try to make this very explicit in the PR I'm working on.

schoppmp commented 6 months ago

I pushed a first draft in #332. @simon-friedberger @wbl do the changes to is_valid plus the text in the security considerations adequately describe the issue?

cjpatton commented 6 months ago

I'd like to see more prose describing the invariants we want to guarantee. Also, as suggested there, I wonder if we want to enforce that the sum of the prefix counts is equal to the number of reports.

wbl commented 6 months ago

@Schoppmp The attacker needs 2^b to recover b bits, but can wait to start recovering until after going down the tree a bit, and can subtract from other counts. So in one round they recover very, next round verysec, next round verysecret, etc.

cfrg / draft-irtf-cfrg-vdaf

Clarify that Poplar-like protocols may leak sensitive information if the collector misbehaves #316