WICG / turtledove

TURTLEDOVE
https://wicg.github.io/turtledove/
Other
513 stars 216 forks source link

Protected Audience AB testing #909

Open fhoering opened 7 months ago

fhoering commented 7 months ago

Why do we need A/B tests?

To give an example of what we mean by long term effects let’s look at a complex user journey and assume that we split users per publisher website (because we have access to the hostname in PA API), on some websites we propose buying strategy A and on some publisher websites we propose buying strategy B, and that we can measure conversions like sales for each ad display.

In retargeting, we show a banner to users multiple times before they buy. For example, a user has added Nike shoes to his basket but has not converted, we will remind him of the product, through ads on several publishers. When he converts, the sale will be attributed to the publisher on which was shown the last ad and not to whatever happened before that. In other terms, it is impossible to measure the effect of a buying strategy A versus B since we will not have a single identifier across sites.

Existing mechanism with ExperimentGroupId

https://github.com/WICG/turtledove/blob/main/FLEDGE.md#21-initiating-an-on-device-auction

Optionally,perBuyerExperimentGroupIds can be specified to support coordinated experiments with buyers' trusted servers. If specified, this must also be an integer between zero and 65535 (16 bits).​

The expected workflow has been described here: Extending FLEDGE to support coordinated experiments by abrik0131 · Pull Request #266 · WICG/turtledove

Our understanding is that this translates to: exp_group_id_workflow

Pros:

Cons:

Splitting per interest group and 1st party user id

Doing a per interest group split seems appealing because for interest groups that are created on one advertiser website one could apply the same changes to the same campaigns to all 1st party users of this advertiser.

This would mainly work for single advertiser AB tests where we target users that already went to advertiser web page. It would work less well for more complex scenarios on all our traffic where we modify the behavior of multiple campaigns on multiple websites and in this case the same drawback as above, the very same user could see behavior changes in population A and B.

As we would split users during tagging phase we cannot guarantee that we really see those users again for a bidding opportunity. So we cannot guarantee an even split as for bidding we might only see n% of users of population A for bidding and a different amount for population B (some more explanation here Approach 2: Intent-to-Treat)

user_id_split

Pros:

Cons:

Using shared storage for AB testing

The shared-storage proposal already has a section on how to activate AB tests. The general idea is to create a unique user identifier (seed) for the Chrome browser with generateSeed, then call the window.sharedStorage.selectURL operation which takes a list of urls, hashes the user identifier to an index in this list and then returns the url for that user. The AB test population would be encoded in the url and as the number of urls is limited to 8 urls it would allow 3 bits of entropy for the user population. As different urls can be used for each call and would leak 3 bit all the time some mechanisms are in place to limit the budget per 24h per distinct number of urls (see https://github.com/WICG/shared-storage#budgeting).

As of now shared storage can only be called from a browser Javascript context and not from a Protected Audience worklet. This means the urls selection can only happen during rendering and not during bidding and therefore shared storage can only be used for pure creative AB tests and not Protected Audience bidding AB tests. So we still need a dedicated proposal to activate Protected Audience AB tests.

Proposal - Inject a low entropy global user population into computeBid

For real world scenarios a global user population would still be needed for AB tests that need to measure complex user behaviors. As injecting any form of user identifier would leak additional information we propose a low entropy user identifier and some mitigations to prevent using or combining this into a full user identifier.

Chrome could cluster all users into a low entropy UserExperimentGroupId something like 3 bits. This identifier should be randomly drawn for each ad tech and not unique to all actors to prevent that our measurement cannot be influenced by the testing of other ad techs.

As attribution is measured for each impression or click we would like this identifier to be stable for some time but it should be also shifted on a certain amount of users to prevent a large population drift over time. Long running AB tests will influence users and then user behavior will change over time introducing some bias. The usual way to solve this is restarting an AB test which cannot be done here for such a limited amount of buckets. So one idea might be to constantly rotate the population. Constantly rotating the population would be also useful to limit the effectiveness of a coordinated attack among Ad Techs to identify a user. If 1% of users get reassigned to population each day it would mean that after 14 days 14% of user might have shifted population.

If the labels are rotated every X weeks, it adds further burden to those trying to collude and update their 1st-party ID → global ID mappings

This new population id would be injected only into the generateBid function and also the trusted key/value server (to mirror current ExperimentGroupID behavior and because many of our computations are still server side, it is secure by design as it will run in a TEE without side effects).

The identifiers could only get out of the of generateBid function via existing mechanisms and that already present privacy/utility trade offs, for example:

If we encode the 3 bits into renderUrl this proposal seems very aligned with the proposal on shared-storage to allow 8 URLs (= 3 bits of entropy) for selectURL to activate creative AB testing (post bidding). In our case as Chrome would control the seed and the generateSeed function can not be used we would not leak more than 3 bit. So introducing any form of budget capping seems not necessary.

To prevent some cookie sync scenario where ad techs combine this new id into a full user identifier Chrome could add an explicit statement to the attestation to prevent ad techs sharing this id.

By design as we have few AB test populations we could only run a limited number of AB tests at the same time but we could reserve this for important AB tests and use the ExperimentGroupId mechanism more for technical AB tests.

remysaissy commented 7 months ago

Hello, I can that we, at Teads are facing the same issue. The situation is well summarized as the possible options so nothing to add on this but we would be very interested in having this issue solved too. Thanks.

fhoering commented 7 months ago

This has been discussed in the WICG from 29th of November 2023.

There has been a question from @michaelkleber on why the scenario on interest group 1st party user id split would not work.

Let's image a scenario where I want to test 2 buying strategies across all my advertisers, one where I always bid 1 EUR (A) and one where I always bid 2 EUR (B).

In todays world I would propose either strategy A or B to users and then measure how much displays, clicks & sales I get. Note that paying less doesn't mean the user will also buy something. What I would like to have is the best buying strategy.

Now let's say 1 chrome browser does 1 auction. I have 2 advertisers, create 1 interest group per advertiser and then split by advertiser 1st party user id. During the auction each IG participates in the auction and out of all Chrome browsers we would have 25% that see AA scenarios, 25% AB%, 25% BA and 25% BB scenarios. For the AA and BB scenario it is all good. For the AB & BA scenarios B would always win the auction as 2 EUR is always higher than 1 EUR. So in 75% of the cases B would win the auction and in only 25% A. So my split would be heavily unbalanced towards B. As I have competition I will not get exactly 75% displays on B vs 25% of displays on A. Also for more complciated buying strategies I will not know that in reality I exposed users 75/25 vs 50/50 to compensate in some form.

So if I can't apply a unique split inside one auction this form of split doesn't seem to work at all for cross advertiser buying strategies even for retargeting campaigns.

As a side note splitting by time (hour, day, ..) usually doesn't work because users don't have the same behavior over time (see black friday for example)

EDIT: removed cost per sales metrics to simplify the example

alois-bissuel commented 3 months ago

Jumping on the subject, to double down on what @fhoering explained, the issue here is that we won't be able to measure during the test what will happen when this tested modification is rolled out.

For instance, if one user has two interest groups for one adtech and one (IG1) is in reference population (no modification of the bidding) and the other (IG2) is in test population. Let's assume that you want to test a large change of bid, ie a big lowering of the bid on some opportunities which might be less profitable. During the course of the test, IG1 will always win on these opportunities. At roll out, IG1 and IG2 will have a more equal chance of winning the opportunity.

Thus, the measure during the test will be impacted by competition within an adtech, which won't happen after roll out.

michaelkleber commented 3 months ago

@alois-bissuel I think we talked about this during the 2023-11-29 call. This kind of bidding experiment is one where it makes sense to randomize A/B diversion based on a 1st-party identifier on publisher site. Now all of your IGs will compete against your other IGs using the same strategy on a single page (or even across a single site), so it will be reflective of the effect of rolling it out broadly.

fhoering commented 3 months ago

In reality changing the bid strategy is a complex behavior. So it will never be as easy as knowing in advance what effect this will produce like the bid will be always lower in all cases.

And in the case of a split by publisher 1st party id I will have the problem that I cannot know which bid strategy produced the user's conversion behavior at the end. If he goes to publisher1 (high bid, sees several ads), publisher2 (low bid, no ad), publisher3 (low bid, sees one ad and clicks) => then buys something. It is what has been shown in slide 7 in https://github.com/WICG/turtledove/blob/main/meetings/2023-11-29-FLEDGE-call-minutes-slides-ab-testing.pdf.

To me this ask still makes sense and 3 bits seems reasonable and very aligned with the shared storage API. It could be seen as converging all Privacy Sandbox APIs.