WICG / attribution-reporting-api

Attribution Reporting API
https://wicg.github.io/attribution-reporting-api/
Other
362 stars 172 forks source link

Poisoning Inputs to Aggregate Service #40

Closed ajknox closed 1 year ago

ajknox commented 4 years ago

Thanks for the detailed explainer, Charlie.

I think there may be a gap in the methods you describe to combat false inputs ruining the aggregate calculation. The Authenticated Inputs section nicely describes how you can use a trust token to authenticate, in a similar manner to previous discussions about even level report APIs. There is an additional problem when the browser is creating secret shares instead of a plaintext report - the value the browser sneaks into the secret share can't be easily verified, might be implausible, and have an out-sized impact on the final report.

Example Attack

Advertiser.shop is using the aggregate measurement API with many publishers to track conversions attributed to certain ad campaigns. Cheater.pub wants to make it look like they are driving more conversions than they actually are. They use a few genuine accounts to look at ads on cheater.pub and then make small purchases on advertiser.shop - since these transactions are all real and have real identities attached to them they will pass with token procedure and submit reports with genuine tokens. However, cheater.pub modifies the browsers they use for this exercise slightly so that they record a wildly inflated value in the RawAggregateReport, near the maximum for the domain. Using the 16 bit domain example, they might be able to turn a $5 real purchase into a $655 purchase report. Even if the total revenue numbers don't add up, it will be very difficult to discover which reports even lead to the discrepancy, let alone which transactions on the website were used to get the tokens or which publisher got the bad credit.

Mitigation

You mention that this work is inspired by PRIO, which tries to mitigate this problem with the novel SNAPS constructions to do efficient ZK proofs that the sum of the secret shares falls within a range. I don't believe that exact approach is feasible here because it relies on some minimal interactivity, and the EncryptedAggregateReport is held in escrow for a while by the reporting origin before being delivered to the helper servers. I believe the limited domain size (16 bits in the example) is intended to address this issue, but it is a bit tricky.

If you want the aggregate report to be able to capture sums larger than the range available to individual RawAggregateReports, you will need a way to translate those secret shares into a larger domain. I'm not aware of a general way to do this (but would love to hear if there is one!), but I can think of some specific tricks for specific cases that don't have great behavior. Example: treat the values in the EncryptedAggregateReport as "signed ints" and translate them into your new domain - this "works" but inadvertently allows negative RawAggregateReport values and allow for overstatement larger than the domain if there are more than 2 helpers!

Questions

Do you have additional thoughts or information about preventing this type of report poisoning?

Do you plan to translate the values into a larger domain for reporting?

Do you envision helper servers enforcing domain bit length, or does the encryption method enable the reporting origin to enforce length based on the EncryptedAggregateReport payload?

csharrison commented 4 years ago

Hey Andrew, thank you for the detailed issue! Yes indeed this is a shortcoming in the authentication scheme we proposed (in retrospect I should have mentioned it somewhere in the explainer but it was already getting pretty long-winded). The main reason we didn’t propose range proofs on the shares was due to their complexity. However, I’d definitely be interested in looking closer at how we can iterate on this proposal and improve things.

Do you have additional thoughts or information about preventing this type of report poisoning?

Beyond techniques like SNIPs, the two helpers in our construction can run secure two party computation protocols to check whether the shared values are really in the required range and discard records that do not conform to this. Of course, such protocols are more expensive than the current design.

One approach to alleviate this cost is to do this check only on a subset of the records chosen at random, which will give us some probabilistic guarantees - we can view this as an auditing mechanism. There may also be techniques to embed information within the authentication tokens (hidden from the client) to flag certain records for the helpers to audit.

Alternatively we can try to come up with tailored two party computation protocols that relax in some way the privacy or the soundness guarantees. For example, there may be schemes where accepting additional information leakage (perhaps formalized with DP bounds) between the helpers can achieve the goal with better performance.

Do you plan to translate the values into a larger domain for reporting?

I think this was underspecified in the explainer, where we mentioned 16 bit inputs. This limit on the input domain was primarily for limiting the sensitivity of the input on the client so that we could apply DP bounds (that is, noise added to value sums will scale proportional to 2^16).

For the MPC protocol we would want to choose a field much bigger than 16 bits, since it needs to be large enough to express the full sums (as you mentioned). Unfortunately this means that attackers in this scenario can report values greater than 2^16 since they can bypass any limitations imposed by the client. We don’t have any proposed mechanism to get around this, but I’m interested in hearing ideas. I’ll think more about the ‘signed int’ idea :)

Do you envision helper servers enforcing domain bit length, or does the encryption method enable the reporting origin to enforce length based on the EncryptedAggregateReport payload?

The client will enforce the 16 bit limit (i.e. the limit needed for capping sensitivity) and the helper servers will enforce that its inputs fall within the domain it is using for aggregation. I don’t think we have a scheme where the reporting origin can enforce any of this.

ajknox commented 4 years ago

Range proofs are complicated, but my intuition is they will be a lot less expensive than range checks with 2PC. Limiting to a range is far from fool-proof, but coupled with token/blind signature auth you can make an attack more burdensome or expensive.

I don't think checking a subset at random will be worth it - since a single row may be able to poison the entire query you probably need to range check everything or nothing. E.g. if you only check half the rows, attacks will still succeed 50% of the time - determined attackers can just attack repeatedly and some will go through. If you are going to bother at all you probably need to check 100%.

You could imagine "post hoc audits" if poisoning is easily detected in the calculation outcomes, but even this sounds like an avenue for malicious clients or maybe even publishers to trigger audits for denial-of-service.

I follow the point about 16 bits being more related to DP than as a range proof mechanism. Thinking about it more it's kind of the same thing - forcing a tight range both provides protection against poisoning and helps keep sensitivity in line. You may be able to offer this as a report parameter - the reporting origin could ask for a smaller domain/range to get better sensitivity or ask for a larger domain/range to avoid bucketing.

Don't spend much time on "signed ints", I'm pretty sure it creates more problems than it solves!

csharrison commented 4 years ago

For sensitivity issues, I was imaging that a client would optimize sensitivity by scaling their inputs to "fit snuggly" in the 16 bit bit domain, since it seemed like enough bits to express a reasonably large range of values. For clients with e.g. inputs in the smaller range [0, 1023] they would just multiply the input by 64 on the client, and then scale down aggregates later on, and vice versa for clients with input larger than 2^16.

Conceptually, I imagined the input as a real number in the range [0, 1]. The 16 bit integer is just an approximation of that concept which works well with MPC schemes.

This avoided us needing to offer another configuration knob, and the helpers could just deal with all inputs having a fixed sensitivity.

Your points about the range queries make sense to me. I have a quick question: would you be OK with range proofs where the range is non-dynamic based on conversion data? e.g. you could verify the range is in the range of 16 bits, but nothing more specific? I worry adding functionality beyond that may have adverse privacy properties.

henrycg commented 4 years ago

[Whoops... I posted that last comment before I finished writing it. Here is the full thing.]

Hi Andrew and Charlie, I have been working with the folks at Mozilla on their pilot private-aggregation system and I'm also one of the authors on the Prio paper.

Data-poisoning attacks are a concern in Mozilla's deployment as well, so we have spent some time thinking through the different potential defenses against these attacks. It's definitely possible to use traditional zero-knowledge range proofs or two-party computation to sanitize the secret-shared data before operating on it. However, both of these techniques generally require many heavy public-key cryptographic operations either on the client side, the server side, or both.

Prio's approach ("zero-knowledge proofs on secret-shared data") ends up being much lighter-weight, since it doesn't require any extra public-key cryptographic operations on either the client or the server. As far as I know, in most settings in which there are multiple parties holding secret shares of the data—as here—Prio-style techniques for preventing data-poisoning attacks strictly dominate the other approaches, in terms of communication and computational cost.

We also considered randomized checking to detect these poisoning attacks: you could imagine partitioning the data into k buckets, computing a statistic over each bucket, and then outputting the median of the k statistics. Unfortunately, as Andrew points out, an attacker who can control a few clients (around k) can still completely poison the output. In addition, these approaches tend to reveal extra information (k statistics instead of one), so there are negative implications for privacy as well.

csharrison commented 1 year ago

Closing this old issue as we no longer support an MPC-based backend so these kinds of poisoning attacks are less dangerous. We'll re-open if we ever introduce an MPC backend.