WICG / privacy-preserving-ads

Privacy-Preserving Ads
Other
98 stars 20 forks source link

Batching operation from Noisy Ranking #45

Open michaelkleber opened 2 years ago

michaelkleber commented 2 years ago

I'm trying to understand the "batching" and "caching" operations from the Noisy Ranking proposal, and in particular how they helps with the re-identification risk that we've discussed before in #11 .

I see that each batched request contains a list of browserEmbeddings represented as dense vectors (which have undergone noising). If I understand correctly, each of those embedding vectors is based on both the page the user is visiting ("Publisher-user input features") and the user's stored ads interests ("User-ads input features"), so of course the noising/batching/caching is key to the system not leaking browsing history.

Are the vectors in a particular batch supposed to share anything other than the choice of DSP and embedding model? In particular, is there any batching by publisher going on, or might they all come from different pubs? And if the pubs might be all different, then do you expect your DP noising to be large enough that a buyer would not even be able to figure out which embedding vector came from one particular "colluding publisher"?

I ask because if the DSP can figure out which embedding vector came from a particular publisher, then that seems to make it easy to undo the batching. But on the other hand, if a DSP cannot figure out the publisher of a particular vector, then I worry that the design will be very difficult to integrate into buyers' existing work flows, which of course has been a key PARAKEET design goal.

I have a similar question about the caching operation behind the throttling. Would a particular user's cache entry from their visit to pub1 be reused for a visit to pub2? If not, then the DSP could see many requests from the same user in which the user-specific signals stay the same (up to noise) but the publisher-specific signals vary, and we're back to a re-identification attack.

Apologies if these questions reveal that I'm misunderstanding a part of the proposal, which I readily admit I have not yet fully absorbed.

jpfeiffe commented 2 years ago

I think you're understanding the core of the proposal well, and exposing some of the details that need to be designed.

The batching is at a per-DSP level: not at publisher. There are a small number of signals we believe are probably fair to expose (e.g., language, region/datacenter) for cost-to-serve mitigation, but we agree publisher would be extremely identifying. A well-trained embedding, with the norm constraints, batching, and noising per vector, should ideally narrow down the search to a group of similar publishers, but with the noise will likely not identify an individual publisher. Further, the DP noise w/ random permutations will ensure a minimal delta between whether a user is in the dataset or isn't. As a consequence, if the DSP wants to ensure they have coverage across publishers they need to ensure a stratified sample is sent back.

The cache also is not fully designed, but lookups from the cache:

a) Would not involve a call to the DSP, and would reside on the TM. The TM then could use the real user vector, or really any other signals that might be available. b) Could be keyed off the user, but could also be the result of other user lookups that might be more relevant to the current user request.

An interesting question remains of when we can issue additional requests for a user: e.g., if for a user we have:

a) A current vector u_t b) A previous sequence u_1, ..., u_t-1, and c) A constraint min_i <u_t, u_i> > DPNoise

Can we now send the noisy vector again? Intuitively... yes, but for now, we simply throttle to the daily granularity to restrict the calls, along with some additional repeated user constraints.