Potential deanonymization attack

This is per the discussion on https://github.com/AdRoll/privacy/issues/1 , but is spelled out more clearly (hopefully).

A malicious actor opts to abuse the Aggregate Reporting API for deanonymization purposes. They start by selecting a subset of users to track, by, for example, flat bidding on each user, with each user having a unique flat bid. For example, they could select their flat bids from a superincreasing knapsack of the form {1, 2, 4, 8, 16, ... 2^N} which will permit the tracking of (up to) N distinct users as these are effectively user IDs. Each user is frequency-capped to one impression only.

The malicious actor, on impression delivery calls:

const entryHandle = window.writeOnlyReport.get(hash('publisher.example'));
entryHandle.set('spend', userID);

Spend here is just an example; any summable quantity will work. With no noise added, the malicious actor can call the API and get back sum(spend) across all users. This sum, because each user is effectively represented as a bit in an N-bit binary number, perfectly identifies which users have been to publisher.example. (Note that being a binary representation isn't necessary for the attack; any superincreasing knapsack will work.)

In theory, differential privacy adds noise to prevent any revelation that a particular user was included in the dataset. However, the malicious actor attempts to circumvent this. On impression delivery, the malicious actor does:

const entryHandle1 = window.writeOnlyReport.get(hash('publisher.example1'));
entryHandle1.set('spend', userID);
const entryHandle2 = window.writeOnlyReport.get(hash('publisher.example2'));
entryHandle2.set('spend', userID);
const entryHandle3 = window.writeOnlyReport.get(hash('publisher.example3'));
entryHandle3.set('spend', userID);
...
const entryHandleM = window.writeOnlyReport.get(hash('publisher.exampleM'));
entryHandleM.set('spend', userID);

The attacker then queries the API for each of these M reports and receives s1 = sum(spend1), s2 = sum(spend2), ..., sM = sum(spendM). They then compute int((s1 + s2 + ... sM) / M), effectively canceling out the mean 0 noise that was added by differential privacy. The resultant integer, they then convert to binary and read out which users have been to publisher.example.

Note that it is not sufficient to limit the queries by the malicious attacker: they could be using multiple identities with different access policies to the reports, or be colluding with other malicious actors. An increase in noise added by differential privacy can be countered by increasing M. Also, M doesn't need to be so large as to completely recover the least-significant bit; even revealing the most-significant bit results in an epsilon of infinity.

Instead, it seems that the browser needs some mechanism to detect that an attack like this is occurring, which could be difficult given that the different hashes obfuscate that a record for publisher.example is being repeated M times (and these hashes could be a legitimate use case, for example tracking spend on each of (creative, interest group, campaign, advertiser, publisher, etc.), which are likely to appear as hashes on their own).

csharrison / aggregate-reporting-api

Potential deanonymization attack #18