WICG / attribution-reporting-api

Attribution Reporting API
https://wicg.github.io/attribution-reporting-api/
Other
363 stars 173 forks source link

Alternative idea for a discussion #15

Closed victr closed 1 year ago

victr commented 4 years ago

I'd like to suggest a different point of view based on a few different fundamental principles.

  1. Many advertisers are ready to give up the user identity altogether and only need correct event attribution. However, there are too many cases of how the attribution happens, both on technical and business level. They cannot and should not be constrained with limitations other than not using the identity. It also means staying as simple as possible and introducing as little new specifications as possible.
  2. The main privacy goal of the API is to make linking identity between two different top-level sites difficult. I think the goal should be fundamentally different and not mixed up with identity protection. The goal of the API should be solely to provide attribution procedure, while one of the implementation requirements is to not compromise or interfere with browser-specific identity-protection policies and practices. In other words, the attribution API should not solve any identity-focused cases which already exist and can be used (or misused, depends on your viewpoint) regardless of the API, it's a task for other components. As long as it does not introduce new cases it should be allowed to do anything required for the attribution process.
  3. Instead of reducing the 'entropy' and the number of permutations in the data passed along with the attributed events the policy should be focused on increasing it, i.e. making the identification very complicated because of too diverse data.
  4. Unlike user identity, attribution information can be freely shared with s2s calls between vendors. It is beyond browser capabilities (or responsibilities) to interfere with this process. On the other hand, providing a simpler and working alternative makes it possible to remain an active part of the process and introduce privacy-focused features presenting the users with options to influence the process. I.e. the attribution API should not be a police forced upon vendors but a policy welcomed and appreciated, because the only way for mass adoption is not punishing legal and legitimate vendors for the crimes they have not committed.

From the practical side, I suggest entertaining the following scenario.

Step 1. Publisher.com creates a friendly frame with a special attribute to inform the browser that it should be handled differently and loads SSP framework into it <iframe allowattribution="true" src="https://ssp.com/serve/foo/bar>. Such frame becomes stateless and does not provide access to cookies, localStorage or any other data storage, regardless of the origin of the resources. For the time being, let's assume this means the ad vendors are not able to establish or persist the user identity. (I know it's strictly speaking not quite true, it's just a different point of the discussion).

Step 2. Instead of data storages the scripts in said frame can access window.transactionID which is a random read-only token generated by the browser at the moment of the frame creation, e.g. 3RAmJKyIdcdmfOZg9TpUTl9a9BP4gJHqwyneVdwCh4r8RiwSgnHSHznl3mKXcENDX.

Step 3. SSP, DSP and other ad vendors register the tracking end-points associated with the current ad and the type of the conversion, e.g. window.trackingService.register('purchase', 'https://dsp.com/tracking?foo=bar&transaction={transactionID}');. There is no limitations to the data that can be passed in the trackers. {transactionID} is a macro that will be rendered by the browser later on.

Step 4. Ad vendors render the creative, the user clicks on it and get redirected to advertiser.com/landing.page. The transaction ID from the original frame is now assigned to the advertiser.com but it is not accessible for the advertiser, i.e. there's no API method or any different way to obtain the transaction ID (or even the fact that it's assigned) for any script or resource. (Alternatively, the transaction can be assigned to the browser tab or session). It can be optionally restricted to only one associated transaction meaning attribution happens to the last click, preventing attribution fraud. It can also have an expiration period.

Step 5. The user can freely navigate between different pages on advertiser.com and at some point executes an action which needs to be tracked. At this point a script on advertiser.com (regardless of its origin) makes a simple call to the API window.trackingService.fire('purchase') and the previously registered trackers are fired with {transactionID} macro rendered to the transaction ID of the current domain (tab, session). The conversions names can be limited to a fixed short list, a few hundreds options should be enough for any reasonable case. It can also be amended with a limited conversion value, e.g. a single byte. Such restrictions mean there is a limit to the amount of information which can be shared between advertiser and ad vendors by the means of this attribution API, but not to the way attribution providers work and execute their tasks.

Now, I am not saying that this is a bullet-proof solution against identity sharing, but I am saying that this solution does not introduce any new loopholes and the browsers can freely apply any restriction they consider necessary to prevent identity sharing, such as restricting cookies/localStorage access for any vendor at any step, preventing URL decoration, etc. There are workarounds to circumvent them, but they exist outside of the scope of the event attribution.

To make this note shorter I'm skipping interesting and more complicated cases and scenarios, but I'd be happy to discuss them in depths.

csharrison commented 4 years ago

Hey victr, thanks for filing this issue and sorry for the delay in responding. Let me start with the principles:

[advertisers] cannot and should not be constrained with limitations other than not using the identity

I wish this were the case, but if we could trust advertisers and ad-tech to only use identity APIs for identity and not use other APIs to side-channel it, we wouldn't be seeing an increase in fingerprinting. All major browsers are taking a stand against APIs that provide additional tracking capabilities.

I think the goal should be fundamentally different and not mixed up with identity protection

See the point above. We can't untangle the two concepts because all APIs have the potential to be abused and we are seeing that abuse in the wild. I agree the utility goal should be closer to "provide attribution procedure" but unfortunately new API proposals have to contend with the reality that bad actors will abuse them and come built-in with privacy protections. Of course, if we specify the solution, I think we should leave lots of room for different browsers to make different privacy / utility trade-offs.

Instead of reducing the 'entropy' and the number of permutations in the data passed along with the attributed events the policy should be focused on increasing it, i.e. making the identification very complicated because of too diverse data.

Can you be clearer about how this would work, assuming the publisher and advertiser are explicitly colluding to join identity? I understand what you are hinting at but am having trouble picturing it in reality, even with your proposal.

Unlike user identity, attribution information can be freely shared with s2s calls between vendors. It is beyond browser capabilities (or responsibilities) to interfere with this process

I agree. All of our ideas to improve web privacy should be robust to s2s calls. Please let me know if you think I've missed this in the existing design. We've certainly tried to address it.

In general, I think your suggestion is really quite similar to what we have listed in our design, with a few modifications. The following modifications radically change the privacy characteristics of the API:

These changes make the API too easy to abuse, and can be used to recover full cross-site tracking. Even without removing the conversion delay, a 16 bit identifier can uniquely identify someone from a population of 2^16, which makes it possible for lower traffic sites to "join identity" with the publisher with a single conversion report. Removing the conversion delay just makes it possible to do this trick on ~all sites by using the current timestamp as an additional join key.

As for some of the other things you suggested:

This changes some security guarantees of the API and lets potentially misbehaving script on an advertiser page log fraudulent conversions. Is the ergonomics win here really worth it? This seems like it might be difficult to do if we implement something like the idea proposed in issue #13 too.

This is reasonable as long as the impression id in the browser can ~uniquely identify a user. Otherwise the set of reporting endpoints (and their URLs, etc.) can be used to generate more entropy on the publisher side which may make it difficult for privacy-sensitive browsers to adopt.

I think this is fine but I don't immediately see the benefit. Allowing the SSP control over this identifier seems to give more flexibility without sacrificing privacy.

victr commented 4 years ago

Hi csharrison

Thanks for you reply.

We can't untangle the two concepts because all APIs have the potential to be abused and we are seeing that abuse in the wild. I agree the utility goal should be closer to "provide attribution procedure" but unfortunately new API proposals have to contend with the reality that bad actors will abuse them and come built-in with privacy protections. All of our ideas to improve web privacy should be robust to s2s calls.

I think I might have failed to fully explain my idea, so let me try in different words. The way I was thinking about it was "Would it be possible to apply the same user-identification technique outside the scope of this API?" If the answer to this question is "yes" then it should not influence the implementation because preventing identification should be the task of other specs and tools. But if the answer is "no" then of course this particular idea should be replaced so to not introduce new vulnerability. I.e. the identity-related aspects should be treated as limiting factors and not as driving forces, otherwise there will be much less value in the API. If you think about it from this angle the context for many features is very different. As an example consider the delay before firing the tracking pixels. It's true the timestamp can be used to match events, but what can prevent the publisher to fire it as a first-party outside the scope of this API? The sites can initiate as many requests as they want without time restrictions. I struggle to find any reason how firing similar pixels via API would be different. On the other hand, if the browser introduces restrictions in terms of timing, referrer, etc, they can be equally applied to the requests regardless of the initiator. Or not equally, again up to the module responsible for this. In either case, it should be outside the scope of the attribution.

Have the browser automatically generate a click-id (transactionId) rather than the SSP. I think this is fine but I don't immediately see the benefit. Allowing the SSP control over this identifier seems to give more flexibility without sacrificing privacy.

Not necessarily, especially with the header bidding where it can be used as a unified ID for cross-SSP requests, something that has been talked about for quite a long time. Plus the reason behind it is not convenience, it's the opportunity for cookie/storage-less conversion attribution. It's also fairly doable to match this external ID for most SSPs with internal data, many publishers already pass something similar anyway.

Instead of reducing the 'entropy' and the number of permutations in the data passed along with the attributed events the policy should be focused on increasing it, i.e. making the identification very complicated because of too diverse data. Can you be clearer about how this would work, assuming the publisher and advertiser are explicitly colluding to join identity? I understand what you are hinting at but am having trouble picturing it in reality, even with your proposal.

The idea here is to use random high-entropy values whenever the data exchange is initiated by the means of the attribution API. I.e. instead of limiting the number of bits introduce an [almost] random value which can serve the purpose but is not helpful for the identification. I haven't completely followed this idea in my proposal, but the transaction ID is an example of how this concept can be applied - whenever the authority to create some data lays within this API implementation the values should vary significantly between two page views.

In general, I think your suggestion is really quite similar to what we have listed in our design, with a few modifications.

I'm afraid I don't quite agree and I believe these two ideas are very different. Please try to consider the suggested changes from the perspective of adops, publishers and marketers. The amount of efforts required to implement your current suggestions is very significant, to say the least. They'd have to keep the record of the used tracking domains, agreements on event structure, discard any redirects, change the way the ad is delivered and the way it's rendered, and so on, basically every technical step in the ad transaction would be affected. I can hardly think of any back-compatibility with the existing solutions, which means a huge amount of dev resources would have to be spent on several levels, each of them presumably reducing the adoption of the specifications. If I were one of those people, with all the limitations and requirements, I can't help but think about it as a necessity I'd have to tolerate due to the lack of alternatives. On the other hand, I've been trying to think of a substitute that can be appreciated and welcomed, which means it has to be much more flexible and suggest an approach beneficial to the parties involved. I'm not saying that my suggestion can serve like such a solution, just using it as an example to present an alternative point of view. E.g. I believe the advertisers would appreciate the possibility to just fire conversion "purchase" without the need to tailor the process for each publisher individually.

Rather than limit conversion information to 3 bits, limit it to ~15-16 bits (an enum with a few hundred elements plus a byte for value). Noise on this metadata is eliminated. This changes some security guarantees of the API and lets potentially misbehaving script on an advertiser page log fraudulent conversions.

These are indeed some flaws, as well as some other points. My suggestion is merely a draft idea I've been discussing and thinking about, but certainly not a full specifications with days and weeks of works behind it, like yours. I think I'm just trying to catch your attention and present a different perspective, I'll be happy to elaborate should I succeed :) Please let me know if you're willing to explore this alternative in more details.

csharrison commented 4 years ago

Hey Victor, thanks for clarifying. Let me respond:

I think I might have failed to fully explain my idea, so let me try in different words. The way I was thinking about it was "Would it be possible to apply the same user-identification technique outside the scope of this API?"

Yeah I think this is our miscommunication. I understand that many identification techniques can be used apart from this API to join user identity across sites, third party cookies is just one of them. However, we the Chrome team are invested in long-term removing these tracking vectors including third party cookies. Tools like conversion measurement APIs will be used to support use-cases that would otherwise be lost without cross-site tracking, in a way that provides better privacy than the status quo web.

As an example consider the delay before firing the tracking pixels. It's true the timestamp can be used to match events, but what can prevent the publisher to fire it as a first-party outside the scope of this API? The sites can initiate as many requests as they want without time restrictions. I struggle to find any reason how firing similar pixels via API would be different.

I'm not sure if I understand 100% what you are saying. The reason we implement delays in the API is to disassociate events from each other. The fundamental capability we are offering is the ability to join events from across sites in a way that doesn't join user identities. If you can associate a conversion report with a conversion event, the 3 bit identifier has the chance to become a unique identifier and the API loses its privacy properties. The advertiser / publisher firing normal pixels (resource requests) does not give them this ability without other means of tracking like 3p cookies.

Not necessarily, especially with the header bidding where it can be used as a unified ID for cross-SSP requests, something that has been talked about for quite a long time. Plus the reason behind it is not convenience, it's the opportunity for cookie/storage-less conversion attribution. It's also fairly doable to match this external ID for most SSPs with internal data, many publishers already pass something similar anyway.

Thanks, I didn't know header bidding use-cases actually wanted a unified ID. It technically is compatible with the existing API design for something like prebid.js to generate the impression id and pass it along to downstream bidders, but I understand that making this a standard reduces client complexity here. The major downside I see is that, if the ID offered is low entropy (i.e. like private click measurement), choice of this ID may end up being important on a per-SSP basis. I can totally see different SSPs having different ID allocation strategies here, just as I can see browsers having different choices as to how much entropy to include in the click id.

The idea here is to use random high-entropy values whenever the data exchange is initiated by the means of the attribution API. I.e. instead of limiting the number of bits introduce an [almost] random value which can serve the purpose but is not helpful for the identification.

As long as this value is observable to web content, it can be used to join sessions together. If this value is not observable to web content, then I don't quite understand the value, because a conversion report with this transaction ID is not meaningful (you can't join it with data you know about the impression).

I'm afraid I don't quite agree and I believe these two ideas are very different. Please try to consider the suggested changes from the perspective of adops, publishers and marketers.

Sorry, I didn't mean to be dismissive, I was just trying to tease apart the differences between your proposal and this one, and I mentioned this because I wanted to go item by item enumerating the specific differences I could spot.

I'm happy to dive into some of your ideas more, but we're definitely very constrained here by the privacy guarantees we want, so anything that regresses privacy will be a very hard sell.

victr commented 4 years ago

Hey Charlie. Thank you very much, I really appreciate the opportunity to have such a conversation.

1. I think that we might not be sharing the exactly same understanding of what privacy is which is essential to agreeing on what better privacy is. I don't mean it on philosophical level (which I'd love to discuss just out of curiosity) but as practical definitions. Is there any public document which outlines the key definitions you use in your team? Something of a very practical type, for example the scope of user identity (session/persistent/ttl, domain/origin/cross-origin), tracking data, attribution, etc. These things are often implied as self-explanatory but they're most certainly not and I couldn't find relevant documents.

  1. Let's focus on delays, it looks like a really good showcase to spotlight the important part.

    The reason we implement delays in the API is to disassociate events from each other.

Correct me if I'm wrong but the attribution is the process of associating events with each other. The successful attribution operation means the two or more entities (e.g. publisher, ad tech, advertiser) are able to unequivocally associate an event (i.e. a piece of data) in their system/dataset with corresponding events in other's systems/dataset. Would you agree with this definition? Could you please elaborate on what exactly you mean by 'disassociate events'?

  1. The fundamental capability we are offering is the ability to join events from across sites in a way that doesn't join user identities.

It's a very good point! From my understanding, joining identities is not a privacy issue, persisting identities is what constitutes a privacy flaw everybody's so eager to solve. Consider the trivial ad transaction - a publisher, an ad vendor (ssp, dsp, whatever) and an advertiser. For simplicity let's assume that they comply with legal frameworks and provide some sort of consent management. The publisher can surely create an identity as the first-party. It doesn't really matter if it's 1p cookie, deterministic login, or something else, this identity can exist outside of the browser should the 1p storage be denied. The next step is the publisher passing this identity to the ad vendor. In the most trivial implementation it's just an url parameter, but in case of adopting ITP-style approach to url decoration it's quite easy to encrypt the id into encoded url or even pass it in s2s request. At this point the two identities - in publisher's system and in ad vendor's - are established and connected. Regardless of what's going to happen next the ad vendor will be aware of the user identity when the next interaction with the same publisher takes place (i.e. next ad transaction). The next step is sort of similar - ad vendor passes identity (as clear url parameter, encrypted or s2s request) to the advertiser. In its turn, the advertiser holds a valid interaction with the user (at the very least on the landing page) and can also create the identity as the first party. By firing a simple request (either as http resource or s2s call) it becomes connected with the other two players. So at this point the three identities already exist in their respective datasets, each connected with the others, even before the need for attribution appears. When the next transaction happens just one identity (e.g. on publisher side) is enough to restore all of them and they can easily exist outside of the browser scope. Not only is this the status quo but something that is out of control for browser vendors. On the other hand, what I believe matters here is the possibility to persist the identity information, or to be more precise - the scope in which the persisted data is accessible. In the outgoing world of 3p cookies it was easy to have unrestricted storage and scope of access. The cookie/storage was a bridge which allowed the data to travel between parties unrestricted. Now, entertain the idea the ad vendor is willing to give up any user-side storage privilege and do not persist the identity (at least on the user side). For the time being let's even assume that the advertiser is willing to do the same. The identities exist and connected but the scope of application is limited to the original publisher, which is the inevitable minimum level anyway. From my understanding it's the ultimate improvement of "privacy" which can be achieved on technical level (i.e. not through legislation) - abandoning the possibility to persist the data on the client side. I also know from practice that a reliable attribution can be a good enough incentive for this. For the last step, let's assume that three independent events were created - impression, click and a purchase (to keep things simple). Either by the means of new attribution API or by simply firing legacy pixels these events are going to be linked to each other and the publisher, ad vendor and advertiser can see the events in their reporting systems. The information they can extract from the timestamps does not really bring anything new to the picture, because the identities had been established and connected before the attribution even started and having more information does not change the scope of availability for the persisted data. So in this hypothetical transaction, what exactly does the delay change? Which use case can be executed by the parties involved which you would consider a 'privacy violation'? A totally genuine question, I fail to find a scenario which can be considered as undesired from the user perspective. Looking forward to hearing your thoughts on this :)

csharrison commented 4 years ago

Is there any public document which outlines the key definitions you use in your team? Yes we are trying to document that although the process is incomplete. There is some early documentation at https://github.com/michaelkleber/privacy-model but it is being superceded by the W3C PING target privacy threat model https://github.com/w3cping/privacy-threat-model. The key issue in both of these that is relevant here is the unexpected cross-site recognition of a user.

Correct me if I'm wrong but the attribution is the process of associating events with each other. The successful attribution operation means the two or more entities (e.g. publisher, ad tech, advertiser) are able to unequivocally associate an event (i.e. a piece of data) in their system/dataset with corresponding events in other's systems/dataset. Would you agree with this definition? Could you please elaborate on what exactly you mean by 'disassociate events'?

This is a really great point. For full fidelity attribution, what you describe is exactly correct (associate two separate, full fidelity events together). Unfortunately adding this capability to the web platform naively leads to an API that can be abused to perform cross-site tracking.

Our approach has been to take full fidelity attribution out of scope for a conversion measurement API and focus on coarse fidelity attribution only. This means that two events will not necessarily be associated together if they occurred cross site. Rather, the API will only present you a coarse view of the association:

In both of these cases, we try to provide useful data about attribution without giving the entire view of each specific event.

Consider the trivial ad transaction...

In your example, we've successfully been "cross-site tracked" as soon as the user visits the landing page with the URL annotated with the publisher ID, and it is joined with the advertiser ID, and you are correct that this is a capability that is offered today. However, just because it is a capability today doesn't meant it will remain so, especially as it can be used for cross site tracking. We are building these APIs envisioning a future world where cross site tracking is greatly reduced.

Even if you do tricks to limit the scope of the stored identities, you can still pull tricks like encoding sensitive data in the conversion event like PII / credit card info / etc.

victr commented 4 years ago

Hi, sorry for a little delay.

I'm really puzzled at this point.

  1. Why do you treat all cross-site tracking equally? The case when the sites are parts in the same transaction (i.e. when the user intentionally clicked on a link and navigated to a different site) is entirely different to cases when the relationship between sites is established in a different way, such as shared monetization service or belonging to the same publisher. Don't you think a more granular approach would be more fitted?
  2. To the best of my knowledge, full fidelity is the only attribution the marketers care about. With the limitations you're suggesting the api won't be able to provide enough granularity to cover even medium size campaigns, and I mean just basic attribution of clicks to placements. Things like dynamic creatives, retargeting or any reasonable optimization are out of the question completely. On the other hand, it's more than feasible to execute the normal attribution via s2s requests without any external manifestation in the browser. The only value the browsers added to the attribution is simplicity and distributed client-side storage. As long as this option is no longer available there's a very good incentive to finally move this process completely out of scope of the browsers, hopefully even establish an industry-wide standard which is long overdue. So if I may ask, why do you think the ad vendors will choose to invest into the very restricted attribution api instead of independent solution which can finally get rid of unwanted intermediary?
csharrison commented 4 years ago
  1. I'm not saying we do treat all cross-site tracking equally. You can view the existing explainer in the README as a "coarse" way to do cross site tracking that is only applicable to clicks. Supporting conversions attributed to views probably deserves more privacy since the user isn't navigating directly from publisher --> advertiser in that flow.

  2. We hope to both come up with solutions that work for many use cases, and we are currently exploring the aggregate variant of this API to satisfy many of them. I'm not sure precisely what requirements things like dynamic creatives need, can you elaborate? For things like remarketing, we are exploring the TURTLEDOVE explainer which recovers some of that functionality,

Additionally with respect to attribution using s2s calls, we hope to clamp down on some of the techniques to covertly track the user (like fingerprinting) that allow that possibility.

victr commented 4 years ago

Dynamic creatives case is only different in the number of ads for which the ad vendor needs to attribute clicks/conversions, the scale is up to thousands within one campaign. They don't necessarily have PI, just a huge variety of small changes in the creatives in order to find the most efficient one. s2s tracking does not require any fingerprinting, the sessions on different sites are deterministically connected because they belong to the same transaction. You'd have to either interrupt 302 redirects and actively modify the URLs to break the chain, or censor 1st party storage even harder than ITP does. I hope the first is too crazy to be realistically considered, and the second puts the browser in defending position and we're back to strong arming marathon. Needless to say that I do not agree with your original assumption that full fidelity conversion tracking is inherently wrong and should be eliminated, neither as a consumer nor as a representative of an ad vendor. I do hope, however, that the two sides can find a common ground to cooperate. Unlike other solutions attribution can and should be solved, but I'm afraid I do not think setting so many limits that efficient marketing becomes impossible is the way forward.

csharrison commented 1 year ago

I'm closing out this issue for now, since I think your concerns are fundamentally at odds with this proposal's privacy goals.