FLEDE : Optimal “Optimization & Performance” reporting for training bidding models

Hello,

As it was recently mentioned in another FLEDGE issue (https://github.com/WICG/turtledove/issues/93 ), DSP and advertisers need access to a reporting capability allowing them to train machine-learning models in order to optimize their campaigns. This reporting is paramount for buy-side actors to be able to learn meaningful information about the contexts driving performance, without gaining information on the browsing behavior of a given individual. As already mentioned in the reporting in SPARROW proposal, these reports should provide info on contextual and interest-based features together

“Any weakness in leveraging both signals together would undoubtedly hurt both the publisher's revenue and the user experience, exposing them to irrelevant advertisements or worse, unsafe content.”

To sum up, we think the proposal should seek to maximize the good information the buy-side actors get (i.e. what allows them to price impressions the most accurately, thus driving investment ups and maximizing overall wealth) under a given well-defined privacy constraint (for instance k-anonymity with differential checks). At Scibids, we see this accurate reporting capability as the single main variable explaining whether or not the world largest advertisers we work for will continue to get the results they expect, and thus investing, in programmatic advertising.

We have worked on the subject and will submit in the next 2 weeks a proposal of what an optimal reporting procedure should look like but would be happy to have your thoughts on the ideas belows and to see which form it could take in the FLEDGE implementation.

Why we think buy-side actors can get from no useful info to almost 100% of useful info under the same k-anonymity privacy constraint

Our main motivation to propose something new lies in how hurtful a “blunt” k-anonymity + differential check would be for machine learning practitioners. Indeed let’s consider the simplest k-anonymity procedure with differential checks:

(blunt method) When asking for a report (e.g. in all generality a click report) the DSP has to provide columns cj=1...J. Any line (c1=x1,...,cj=xj) will then be included if it only concerns more than k users, or else it will be discarded. A “remainder” line will indicate the number of clicks associated to the obfuscated lines, provided this remainder line concerns more than k different users.

With this method, DSPs would be faced with the impossible task to specify for each campaign what columns they want:

asking too many columns would mean getting no data on a large number of impressions (this will very soon be most of the impressions given the high cardinality of important fields like “domain” or “placement”)
asking too few columns means building simplistic underperforming models. Moreover a priori column selection means that these few columns won’t probably be the most discriminating for predicting click.

Please note that this problem cannot be alleviated by asking for a different column set if the first set gives unsatisfactory results, since differential checks would greatly affect the result of this 2nd report.

Improvements over this method have already been proposed in the “reporting in SPARROW” proposal, as recalled below.

(RIS method) When asking for a report (e.g. in all generality a click report) the DSP has to provide an ordered set of columns cj=1...J. If c1=x1 concerns more than k users (c1=x1,...,cj=xj) will then be included* as is if it concerns more than k users. If the line (c1=x1,...,cj=xj) concerns less than k users, it will appear in the (c1=x1,...,cj-1=xj-1,cj=hidden) line instead if (c1=x1,...,cj-1=xj-1) concerns at least k users. The procedure applies recursively.

*with the only exception when (c1=x1,...,cj=xj) concerns the least number of users to be added to the (c1=x1,...,cj-1=xj-1,cj=hidden) such that this line would concern more than k users.

RIS provides a first incremental improvement to a blunt k-anonymity reporting by introducing a notion of “feature importance” which allows :

getting the report that best fits the different actor needs (some actors would be more interested in knowing the domain information whereas others would prefer getting the size information).
getting more information while ensuring the k-anonymity.

However this approach does not solve the problem of requiring the DSP to set in stone an a priori importance of the variables, which is going to vary a lot depending on campaigns typologies and KPIs. This seems suboptimal since the reporting entity has full knowledge of the campaign dynamics and is thus much better equipped to solve the “provide max useful information while respecting k-anonymity” problem than the DSP. There is actually quite a large literature on “optimal k-anonymization” of a dataset and we have probably an opportunity to finally use this work for large-scale practical applications.

We are thus going to propose something along these two axis of improvement:

Instead of having to provide the order of features (for a new campaign we often don’t have a prior on the order of importance of features), we could let the option to the reporting algorithm to smartly choose the order of features and provide the k-anonymity report that maximizes a measure of the information (for instance the entropy) in the report. (this problem is known as the K-anonymity problem in the litterature and we are currently looking into adapting the DataFly algorithm.Eager to hear from you if you have heard of better scalable solutions!).
Instead of deleting or marking as hidden the rows that do not respect the k-anonymity threshold, we could add some generalization by bucketizing the features. For example, for the line cnn.com / age=24, if we have less than k rows, the reporting algorithm could try with the bucket age in [20,25].

Let’s take an example with k=2:

If we have :

domain	age	number post view conversion
cnn.com	24	0
cnn.com	24	1
cnn.com	23	1

Instead of :

domain	age	number post view conversion
cnn.com	hidden	0
cnn.com	hidden	1
cnn.com	hidden	1

We would get :

domain	age	number post view conversion
cnn.com	20-25	0
cnn.com	20-25	1
cnn.com	20-25	1

We are still hammering out the details but we wanted to share our vision before working on a more detailed implementation with a reasonable algorithmic complexity that would fit with the large scale datasets in our industry. .

WICG / turtledove

FLEDE : Optimal “Optimization & Performance” reporting for training bidding models #101