WICG / shared-storage

Explainer for proposed web platform Shared Storage API
Other
88 stars 21 forks source link

Supporting server-side (e.g. pre-ads-auction) experiments #22

Closed rwiens closed 1 year ago

rwiens commented 2 years ago

Hi! I'm really excited to see that this Shared Storage API proposal is trying to create a solution for cross-site experiments. However the current proposal seems best suited for experimenting on within-browser changes, such as additional filtering after an ad network has already delivered an ad. We're strongly interested in enabling user-consistent cross-site server-side experiments, such as controlling whether a new ad format type would even be included in the initial server-side auction, or changing how an ad creative is rendered before it's sent to the browser.

Are there options here to expand this design to better suit server-side experiments (or possibly even experiments with simultaneous client and server-side changes)? There's a very large number of different types of filtering and logic that happens on server-side, so we'd really appreciate a solution that can scale well to a variety of use cases.

If you're uncertain about requirements, here's a few of ours that may be helpful as you think about this:

As a potential starting point idea for you to consider, if you're able to allow third-parties access to a k-anonymous, random bucket ID that's user-sticky and consistent across sites, this would allow us to run cross-site experiments. Each 3rd party (e.g. the ad network) would want an independent user-to-bucket grouping so they don't influence one another's metrics, and that should also benefit the user by making it harder to identify common users across different 3rd parties. It'd be nice if the number of bucket IDs available to a 3rd party was proportional to the traffic the party receives since anonymity is closely tied to the underlying population - e.g. 1 bit can be identifying if you have 2 users, but it's not identifying if it's evenly split among 1000 users. I haven't fully thought through all the attack vectors and privacy implications here so I imagine this rough idea will need improvement. Please let me know if there's further details you'd like to discuss, and thanks for your time.

jkarlin commented 2 years ago

Thanks Rachel for the well explained use case!

You've proposed to have a stable (but can tolerate noise) per-third-party low-entropy id, which can then be used server-side for experimentation purposes. A tricky part here is the per-third-party part, as that means if multiple third-parties collude, than can join their identifiers together to form a global identifier. On the other hand, a browser-wide identifier would extend beyond the capabilities of today's third-party cookies which is a privacy problem of its own. It would be nice if we could do something like we do with Topics here, which is to assign to each user multiple ids (say 5), and give different sites access to a different id, making it harder to join the user across sites. I don't think that works well for experiments though, as something like Lift measurement requires consistency across sites. Ideas very much welcome here in how we might reduce the cross-site information!

The second tricky part is the number of bits provided. This is fingerprinting information that gets compounded with other surfaces (IP address, User Agent, etc..). So we want to reduce such surfaces as much as possible. One mechanism we use already in Shared Storage is Fenced Frames & User Gesture. That is, the destination site can't learn the low-entropy id until the resulting fenced frame navigates (which requires a click). This at least slows down the propagation of the bits, but they still can get out there. The more bits per click, the faster a site can be tied to a globally consistent and unique identifier. This could be mitigated by ensuring that the top-level navigation from the fenced frame doesn't convey the third-party state, or that the information leakage is temporary (e.g., clearing state on the destination site, or making the visit from the fenced frame happen in incognito mode).

Another thing that's different between your proposal and Shared Storage is that Shared Storage assumes that the cross-site information used to make a decision could be different each time. That is, each time information is leaked, we have to assume that it's new information. But your use case requires the same bits be provided each time. There is room for budgeting optimization here in Shared Storage. We could not count the leakage of the same information twice. That would take some careful thought for how we might do that in an efficient and easy-to-use way.

Finally, I think the existing output gate that you might use for this (runURLSelectionOperation) isn't a perfect fit. You don't actually want to supply 256 URLs, it'd be simpler if you just provided a URL template (e.g.,"https:// example.com/ad_requests?user_low_entropy_id=%d" and Shared Storage could write in a few bits. We'd still need to make sure that the output url is k-anonymous, but there is no privacy/security difference between that and runURLSelectionOperation as it stands. This is mostly seems like an ergonomics issue and is lower concern than above.

rwiens commented 2 years ago

Thanks for the quick response! Responses inline. For conciseness I'll refer to the per-third-party low-entropy id we're discussing as the "group ID".

You've proposed to have a stable (but can tolerate noise) per-third-party low-entropy id, which can then be used server-side for experimentation purposes. A tricky part here is the per-third-party part, as that means if multiple third-parties collude, than can join their identifiers together to form a global identifier. On the other hand, a browser-wide identifier would extend beyond the capabilities of today's third-party cookies which is a privacy problem of its own. It would be nice if we could do something like we do with Topics here, which is to assign to each user multiple ids (say 5), and give different sites access to a different id, making it harder to join the user across sites. I don't think that works well for experiments though, as something like Lift measurement requires consistency across sites. Ideas very much welcome here in how we might reduce the cross-site information!

Would you mind explaining more about the attack vector for how multiple third-parties would collude to join their identifiers and form a global one? I'm having difficulty picturing how this could be done post-request, assuming two different third parties are trying to join their logs on a shared piece of PII that wasn't already a unique global identifier. I can only think of the 2 vectors below but would like to understand if there are other risks I'm missing.

A) At serving time, if both of the colluding parties were present on the same request. For example, if the first third-party on the request gets its group ID from Chrome and then redirects the request to a 2nd third-party who also gets its own group ID from Chrome and appends it with the passed-along previous group ID. Are there any mitigations you're aware of that might be able to prevent this?

B) Combining less granular signals to make a more granular identifier remains a risk and we'll need to spend some more time considering whether a high enough value of k and noise would be sufficient or additional mitigations are needed. Third party cookies today are arguably high-entropy, per-third-party, and somewhat stable (modulo the user manually clearing cookies) so I'm not sure I'm following your comment about how a group ID would provide more power than a third party cookie today (assuming users are given control to reset their group IDs). Are you talking about if we instead made group ids global across all third-parties in the browser in order to mitigate A), and the different set of privacy concerns that would introduce?

The second tricky part is the number of bits provided. This is fingerprinting information that gets compounded with other surfaces (IP address, User Agent, etc..). So we want to reduce such surfaces as much as possible. One mechanism we use already in Shared Storage is Fenced Frames & User Gesture. That is, the destination site can't learn the low-entropy id until the resulting fenced frame navigates (which requires a click). This at least slows down the propagation of the bits, but they still can get out there. The more bits per click, the faster a site can be tied to a globally consistent and unique identifier. This could be mitigated by ensuring that the top-level navigation from the fenced frame doesn't convey the third-party state, or that the information leakage is temporary (e.g., clearing state on the destination site, or making the visit from the fenced frame happen in incognito mode).

I agree with your concerns here, but I'm wondering if there's ways we could make the underlying data k-anonymous, so even if the entire contents are leaked then it still doesn't produce identifying information. Put otherwise, the site shouldn't be able to use the group ID by itself to derive a globally consistent and unique identifier, because the group ID should never be a unique identifier in the first place. I acknowledge that this is challenging given that the number of group IDs required for k-anonymity is expected to differ based on each third party's user base, and will need more thought.

The combining with other kinds of data risk as mentioned in B) is also a concern I agree with. Do you have a sense for what maximum level of granularity we need to enforce through this API? E.g. IP address is sometimes a very granular signal so perhaps group IDs wouldn't be compatible with requests which have a full IP address, but if something similar to Project Parakeet were to land which anonymized requests, would it then be feasible to include a group ID in the list of data passed to the servers? If yes, would this be feasible at the scale I mentioned, and are there other considerations or trade-offs we should balance?

Another thing that's different between your proposal and Shared Storage is that Shared Storage assumes that the cross-site information used to make a decision could be different each time. That is, each time information is leaked, we have to assume that it's new information. But your use case requires the same bits be provided each time. There is room for budgeting optimization here in Shared Storage. We could not count the leakage of the same information twice. That would take some careful thought for how we might do that in an efficient and easy-to-use way.

Yes, unlike the original Shared Storage proposal we don't necessarily need to allow for custom experiment diversion conditions. Supporting arbitrary diversion conditions is certainly a nice-to-have, but at least in terms of server-side-only experiments, it's actually better if the experiment doesn't divert on any info that the server doesn't already have. We don't want to incentivize people to build permanent behaviour into experiments. As an example, if the Shared Storage API allowed for completely arbitrary diversion conditions such as likesPizza, and the server didn't know the user's food preferences, some developers may be tempted to create a permanent 100% experiment that diverts on the likePizza condition in Chrome, which is bad for code health and readability, adds questionable dependencies and reliability risks, and arguably introduces privacy risks. Perhaps some specific low-entropy diversion conditions that we'd expect the server would already know could be allowed as a convenience, like geo, but these are not a requirement.

I think only providing the same bits each time is actually a benefit of this approach in terms of privacy - there is less cross-site information available to leak, so less risk of it becoming identifying.

Finally, I think the existing output gate that you might use for this (runURLSelectionOperation) isn't a perfect fit. You don't actually want to supply 256 URLs, it'd be simpler if you just provided a URL template (e.g.,"https:// example.com/ad_requests?user_low_entropy_id=%d" and Shared Storage could write in a few bits. We'd still need to make sure that the output url is k-anonymous, but there is no privacy/security difference between that and runURLSelectionOperation as it stands. This is mostly seems like an ergonomics issue and is lower concern than above.

Yes, you're correct - having the Shared Storage write a few bits into an existing URL template should meet our needs.

jkarlin commented 2 years ago

I've been discussing this internally with the team. Basically, I don't think there is much of anything we can do to support pre-auction use cases as it requires significant amounts of entropy. For A/B, we feel that we can adequately support Lift experiments, Frequency Capping, and ad style experiments via post-auction with ~8-16 urls. What kind of experiments would not be supported with post-auction filtering?

rwiens commented 2 years ago

Thanks for the reply! We'll take some time this quarter to put together a more concrete and comprehensive list of use cases and get back to you on that.

rwiens commented 2 years ago

Hi! I've spent some more time discussing with people on our side, and for my team's use cases we decided to deprioritize cross-site user consistency on experiments so I'll be decreasing my involvement on this particular proposal. If your team finds any new scalable solutions in this space in the future, I'd still be interested so feel free to keep us updated. Thank you for the excellent discussion so far!

I know a couple other teams such as jcma's are still strongly interested in this topic, so I'll leave this conversation up to them to continue as they see fit.

pythagoraskitty commented 1 year ago

Closing this issue for now, but please re-open (or open a new issue), if you have further questions. Thanks.