Avoiding Built-In Tracking in Signed Packages

WICG / webpackage

Web packaging format

Other

1.23k stars 118 forks source link

Avoiding Built-In Tracking in Signed Packages #422

Open johnwilander opened 5 years ago

johnwilander commented 5 years ago

Hi! John Wilander from Apple's WebKit team here.

We are concerned with the privacy implications of User A and user B not getting the same package when they load the same webpage, or to put it another way, personalized signed packages with cross-site tracking built in.

Threat Model

The Actors

The user. This is the human who relies on the user agent to protect their privacy.
The user agent. This is the web browser that tries to protect the user's privacy.
AdTech or adtech.example. This is a website that the user engages with as first-party site and that has a financial interest in 1) knowing what the user does on other websites to augment its rich profile of the user and 2) individual targeting of ads, based on its rich profile of the user.
News or news.example. This is a news website which wants its articles to be served as signed packages with the user agent's URL bar showing news.example.

The Threat

The user does not want AdTech to be able to augment its profile of them while reading articles on news.example.
The user does not want AdTech's rich profile of them to influence the content of ads or articles on news.example.

The Attack

This is how AdTech could foil the user agent's privacy protections with the current signed packages proposal:

News wants to take part in signed package loading but thinks the actual packaging is cumbersome and costly in terms of engineering resources.

AdTech has a financial incentive to help News get going with signed packages because the technology makes AdTech's services better. Because of this incentive, AdTech decides to offer News a more convenient way to do the packaging; it offers to pull unsigned articles directly from News's servers and do packaging for them. News just has to set up a signing service that AdTech can call to get signatures back, or just hand a signing key straight to AdTech. News sees the opportunity to reduce cost and takes the offer.

AdTech also has a financial incentive to identify the user on news.example to augment its profile of the user and to earn extra money by serving the user individually targeted ads, but it can't do so because the user's user agent is protecting the user's privacy. However, the request to get a signed News package is actually made to adtech.example, containing the user's AdTech cookies. To achieve their goals and earn more money, AdTech's decides to create news.example packages on the fly, bake in individually targeted ads plus an AdTech user ID for profile enrichment, and sign the whole thing with News's key.

This is a case of cross-site tracking. The user is on a news.example webpage, convinced that their user agent protects them from AdTech tracking them on this site, but instead they got a signed package with tracking built in.

How the Attack Relates To Other Means of Cross-Site Tracking

Often when we criticize new, technically distinct tracking vectors, we are told that “you can track users in so many ways so why care about this one?” In the case of signed packages we hear about means of tracking such as doctored links where cross-site tracking is built into the URL, or server-side exchanges of personally identifiable information such as users' email addresses.

First, we don't think past mistakes and flaws in web technologies is a valid argument for why new web technologies should enable cross-site tracking.

Second, WebKit is working hard to prevent cross-site tracking, including new limits and restrictions on old technologies. Piling on more such work is not acceptable to us.

Finally, the success of new web technologies such as signed packages relies on better security and privacy guarantees than what we've had in the past. We want progression in this space, not the status quo.

Potential Mitigations and Fixes

A mitigation we'd like to discuss is this:

The server responding with the signed package is required to send the signature up front. This is to incentivize AdTech to not sign other websites' packages on the fly.
The user agent makes an ephemeral, cookie-less preflight request to news.example to get the signature and then validates the package from adtech.example against that signature.
We add a signed time stamp to the package signature to avoid AdTech telling News to get signatures from adtech.example backend and send personalized signatures back as preflight responses. With such time stamps, the user agent can decide to not accept signatures younger than, say one minute. For this to work we need signed, official time.

The above scheme would make it significantly harder to “personalize” packages.

Another potential mitigation would be some kind of public repository of signatures to check against.

sleevi commented 5 years ago

Thanks for filing this John!

I had one quick question regarding the original proposal - the suggestion of a signed timestamp seems to introduce a trusted third party to the negotiation, the timeserver. Do you have any sense or thought as to who would operate such a timeserver, or how a U-A would select such a thing, as it seems like it could either lie (if using a simple time signing protocol) or collude with adtech.example (if using a more robust protocol, like Roughtime)

kevinsimper commented 5 years ago

@sleevi Trusted time could be provided by a blockchain, that would allow something that can't be tampared.

@johnwilander Could content-addressable be regarded as a potential solution as well? By using that solution you could verify by third-party that you didn't get something customized, as yours would be different than everyone else.

sleevi commented 5 years ago

@sleevi https://github.com/sleevi Trusted time could be provided by a blockchain, that would allow something that can't be tampared.

Thanks for the reply!

I think it would probably be more productive if we avoid abstract technology hypotheticals, and instead focus on concrete or actual solutions. The problem with abstractions is that they largely tend to punt the problem being discussed onto the abstraction, rather than providing a solution themselves. That is, “imagine if we had a perfect X that didn’t have problem Y” doesn’t quite solve for Y, and now we also have to solve for X to find an actual X with that property 😅

My previous question acknowledges the possibility of the use of Merkle Trees as a basis for time, by focusing on an actual time protocol (that’s what Roughtime is), and then discusses actual challenges with it that would still exist, as a way of trying to better understand the actual requirements. Collusion by adtech.example is an (extreme) possibility, and thus it seemed important to understand the requirements here, since it seemed like there might be some unstated requirements hidden between that last bullet point. 🤔

@johnwilander https://github.com/johnwilander Could content-addressable be regarded as a potential solution as well? By using that solution you could verify by third-party that you didn't get something customized, as yours would be different than everyone else.

Could you explain how you see this working? Content-addressable storage doesn’t actually provide the guarantee you stated, at least as commonly understood by how CAS works. Indeed, one can view the existing SXG proposal as functionally CAS with an attached signature.

If you mean something like a peer-to-peer distribution network, using things like DHT or the like, none of the existing technologies seem to provide that guarantee. Understanding a bit more about what is meant by this question helps better understand what you see the provided properties as.

If the suggestion is to use a Trusted Third Party and report the hash you see, that of course comes with serious privacy concerns for the end user - it adds yet another way to see what the user is doing. It also introduces a centralized censorship mechanism, by coercing the TTP to lie about whether it has seen a package, and thus preventing it from loading. However, one doesn’t typically think of a TTP as being CAS.

This is why I focused on trying to understand the proposal itself first, to make sure we don’t rabbit hole on such challenges until we’re all on the same page with base understanding 😃

johnwilander commented 5 years ago

Thanks for filing this John! I had one quick question regarding the original proposal - the suggestion of a signed timestamp seems to introduce a trusted third party to the negotiation, the timeserver. Do you have any sense or thought as to who would operate such a timeserver, or how a U-A would select such a thing, as it seems like it could either lie (if using a simple time signing protocol) or collude with adtech.example (if using a more robust protocol, like Roughtime)

Hi Ryan!

Signed, trusted time is a Hard Thing, at least the last time I dug into it. It even plays into human culture where citizens of some countries would trust the government to issue such timestamps and others would rather have an independent non-profit do it.

I do not have a ready solution. But there seems to be a few interested parties who want these signed exchanges, Google and Cloudflare being two. Maybe these parties can propose a solution that we can review? Even if we don't achieve a perfect solution, something transparent and explicitly designed to prohibit abuse may be enough to instill (more) trust in this technology.

There is at least one more benefit of signed, trusted timestamps in these packages and that is the ability to audit when content was created. A temporary compromise of News's publishing apparatus could issue fake news and then push that news to a micro targeted audience to sway public opinion "dark ads"-style. Trustworthy timestamps in packages would at least allow for an audit after the fact. Or if abuse gets really ugly, user agents could support things like "News was compromised between TimeA and TimeB and doesn't know what was published and signed under its name during that time. Therefore all News packages signed between TimeA and TimeB are blocked."

sleevi commented 5 years ago

I do not have a ready solution. But there seems to be a few interested parties who want these signed exchanges, Google and Cloudflare being two. Maybe these parties can propose a solution that we can review? Even if we don't achieve a perfect solution, something transparent and explicitly designed to prohibit abuse may be enough to instill (more) trust in this technology.

I definitely think it's something worth discussing, and I'm wanting to make sure to tease out the requirements a bit more up front, so we can find something workable.

You mentioned trusted time, which evokes protocols like Roughtime (which, incidentally, Cloudflare also supports ). However, the 'trusted' part of that time is achieved by having the Roughtime client send a random 'nonce', and that doesn't seem like a good fit here, for a number of reasons.

From the threat model described, my understanding of how the suggested mitigation would work is that sounds like you're talking more-so about a Time-Stamping authority - some third-party (or set of third-parties) to attest that, at a given time, it was aware of a given hash. Does that sound roughly correct?

Typically, these sorts of approaches imply direct trust in the TSA to always be honest. I was trying to understand how much or how little of your threat model included the TSA as a bad actor - for example, understanding whether or not the threat includes adtech.example colluding with (or operating their own) time-stamping authority. If the threat model is considering this, what sort of mitigations would be seen as acceptable versus unacceptable, to help inform possible solutions?

For example, if the idea is Apple (or other UAs) would select a TSA and explicitly trust it, say, using business controls like audits - an approach Mozilla is taking with their selection of trusted recursive resolvers - then there are simpler options with very little technical complexity, because it's addressed by the business controls. However, if the idea is that there should be zero trust in the trusted time server, except that which can be proved mathematically, then that would require much more complex solutions, which haven't yet been solved for related areas.

There is at least one more benefit of signed, trusted timestamps in these packages and that is the ability to audit when content was created. A temporary compromise of News's publishing apparatus could issue fake news and then push that news to a micro targeted audience to sway public opinion "dark ads"-style. Trustworthy timestamps in packages would at least allow for an audit after the fact. Or if abuse gets really ugly, user agents could support things like "News was compromised between TimeA and TimeB and doesn't know what was published and signed under its name during that time. Therefore all News packages signed between TimeA and TimeB are blocked."

I think this would be best discussed in a separate issue, and understanding the use case more. It seems that there are several use cases mixed up in here, such as repudiation (or revocation) and transparency. Given that almost every technical solution to these sorts of use cases introduces negative effects, they're likely topics in themselves, and worth tracking as such. For example, repudiation/revocation (the compromise scenario described) has commonly enabled greater centralization and censorship, and the transparency aspect comes at significant cost to user privacy (the ability to say "I know you read/published targeted article X").

I don't want to lose sight of these, but also don't want to miss out on the big picture here, so if you have a write-up for these use cases and could file them as new issues, I think we'd be happy to engage. I'm not sure I understand your specific goals there well enough to do it myself :)

nminnov commented 5 years ago

I definitely think it's something worth discussing, and I'm wanting to make sure to tease out the requirements a bit more up front, so we can find something workable.

Agree that the requirements need to be understood and appreciated before discussing the "how". Would you agree that:

"The Threat" as described is indeed a threat to privacy
the threat needs to be addressed

johnwilander commented 5 years ago

I do not have a ready solution. But there seems to be a few interested parties who want these signed exchanges, Google and Cloudflare being two. Maybe these parties can propose a solution that we can review? Even if we don't achieve a perfect solution, something transparent and explicitly designed to prohibit abuse may be enough to instill (more) trust in this technology.

I definitely think it's something worth discussing, and I'm wanting to make sure to tease out the requirements a bit more up front, so we can find something workable.

You mentioned trusted time, which evokes protocols like Roughtime (which, incidentally, Cloudflare also supports ). However, the 'trusted' part of that time is achieved by having the Roughtime client send a random 'nonce', and that doesn't seem like a good fit here, for a number of reasons.

From the threat model described, my understanding of how the suggested mitigation would work is that sounds like you're talking more-so about a Time-Stamping authority - some third-party (or set of third-parties) to attest that, at a given time, it was aware of a given hash. Does that sound roughly correct?

Yes.

Typically, these sorts of approaches imply direct trust in the TSA to always be honest. I was trying to understand how much or how little of your threat model included the TSA as a bad actor - for example, understanding whether or not the threat includes adtech.example colluding with (or operating their own) time-stamping authority. If the threat model is considering this, what sort of mitigations would be seen as acceptable versus unacceptable, to help inform possible solutions?

AdTech operating the TSA sounds problematic. But a shared TSA, funded/controlled/audited by multiple stakeholders could probably work. Also transparency will work in our favor here. It should be easy to check the integrity of the TSA, not just for UAs but for anyone.

For example, if the idea is Apple (or other UAs) would select a TSA and explicitly trust it, say, using business controls like audits - an approach Mozilla is taking with their selection of trusted recursive resolvers - then there are simpler options with very little technical complexity, because it's addressed by the business controls. However, if the idea is that there should be zero trust in the trusted time server, except that which can be proved mathematically, then that would require much more complex solutions, which haven't yet been solved for related areas.

Having not discussed the TSA issue in detail with my team, I'd say zero trust is not a must to get something on the table for serious review.

There is at least one more benefit of signed, trusted timestamps in these packages and that is the ability to audit when content was created. A temporary compromise of News's publishing apparatus could issue fake news and then push that news to a micro targeted audience to sway public opinion "dark ads"-style. Trustworthy timestamps in packages would at least allow for an audit after the fact. Or if abuse gets really ugly, user agents could support things like "News was compromised between TimeA and TimeB and doesn't know what was published and signed under its name during that time. Therefore all News packages signed between TimeA and TimeB are blocked."

I think this would be best discussed in a separate issue, and understanding the use case more. It seems that there are several use cases mixed up in here, such as repudiation (or revocation) and transparency. Given that almost every technical solution to these sorts of use cases introduces negative effects, they're likely topics in themselves, and worth tracking as such. For example, repudiation/revocation (the compromise scenario described) has commonly enabled greater centralization and censorship, and the transparency aspect comes at significant cost to user privacy (the ability to say "I know you read/published targeted article X").

I hesitated bringing up the auditing+dark ads case because, as you say, it's a separate issue. I just wanted to mention it here to make it clear that trusted time stamps might have other benefits too.

I don't want to lose sight of these, but also don't want to miss out on the big picture here, so if you have a write-up for these use cases and could file them as new issues, I think we'd be happy to engage. I'm not sure I understand your specific goals there well enough to do it myself :)

I'll hold off for now to make sure that the cycles I have to spare are spent on this issue here. :)

frivoal commented 5 years ago

Unless, I am missing something, this boils down to "if you hand someone your private keys, they can impersonate you while doing things you wouldn't". Right?

If I understand correctly, the atracks that this enables seem already possible when handing your private keys to a CDN so that it can do https on your behalf.

As you said in the initial post, just because a similar attack already exists doesn't mean we shouldn't do anything about it. So I am absolutely in favor of mitigating this if we can.

However, I think it is worth considering what happens if we cannot. On balance, it seems to me that this might still be an overall improvement to security, because of the https/CDN case.

The attack described here is possible when news.example chose to let adtech.example do the crypto on their behalf. But they can (and should) do it themselves. However, in the https/CDN case, news.example has no choice: if it wants cdn.example to do https on the unchanged URLs, it has to hand over it's private keys. However, with signed exchanges, it becomes possible for news.example to sign its content itself, and have the signed package be delivered via CDNs without revealing its private keys to anyone.

Unless I am misunderstanding this, this means that while the introduction of signed packages may make it tempting to "do the wrong thing" (share your private keys) in more cases, it also makes it possible to do the right thing (do all the signing yourself) in cases where it previously was not. Whether that's a net positive probably depends on how strong the temptation is (i.e. how easy it is to sign packages yourself, how much will addTech.example pay you to do it on your behalf, etc).

johnwilander commented 5 years ago

Unless, I am missing something, this boils down to "if you hand someone your private keys, they can impersonate you while doing things you wouldn't". Right?

If I understand correctly, the atracks that this enables seem already possible when handing your private keys to a CDN so that it can do https on your behalf.

Actually, that is not the case.

Going back to the threat:

The user does not want AdTech to be able to augment its profile of them while reading articles on news.example.
The user does not want AdTech's rich profile of them to influence the content of ads or articles on news.example.

In the case of News handing AdTech a private key to do CDN things from a *.news.example subdomain, the user agent will send news.example's cookies in requests for articles (and possibly in requests to the CDN subdomain). This allows the user agent to protect the user's privacy by blocking adtech.example from accessing its cookies as third-party resource on a news.example page.

In the case of a signed package loaded from adtech.example, the user agent will send adtech.example's cookies in the request which allows AdTech to leverage its rich profile of the user to "personalize" content and ads as well as plant an AdTech user ID in the package to use in third-party requests to enrich its profile of the user.

As you said in the initial post, just because a similar attack already exists doesn't mean we shouldn't do anything about it. So I am absolutely in favor of mitigating this if we can.

However, I think it is worth considering what happens if we cannot. On balance, it seems to me that this might still be an overall improvement to security, because of the https/CDN case.

The attack described here is possible when news.example chose to let adtech.example do the crypto on their behalf. But they can (and should) do it themselves. However, in the https/CDN case, news.example has no choice: if it wants cdn.example to do https on the unchanged URLs, it has to hand over it's private keys. However, with signed exchanges, it becomes possible for news.example to sign its content itself, and have the signed package be delivered via CDNs without revealing its private keys to anyone.

Unless I am misunderstanding this, this means that while the introduction of signed packages may make it tempting to "do the wrong thing" (share your private keys) in more cases, it also makes it possible to do the right thing (do all the signing yourself) in cases where it previously was not. Whether that's a net positive probably depends on how strong the temptation is (i.e. how easy it is to sign packages yourself, how much will addTech.example pay you to do it on your behalf, etc).

Given my explanation above, I'll let you revisit your analysis before commenting further.

cramforce commented 5 years ago

Question: the same planting of IDs can always be done via link augmentation. My understanding is that Safari is trying to protect against that by blocking query/fragment on cross origin navigation.

Could an SXG navigation be made equivalent to a cross-origin navigation by saying: UA will only render the SXG if

the original request was cookieless
was a get request
has no query string or fragment
the path of the SXG request is the same as the path on the target domain.

On Sat, Apr 20, 2019, 11:15 AM John Wilander notifications@github.com wrote:

Unless, I am missing something, this boils down to "if you hand someone your private keys, they can impersonate you while doing things you wouldn't". Right?

If I understand correctly, the atracks that this enables seem already possible when handing your private keys to a CDN so that it can do https on your behalf.

Actually, that is not the case.

Going back to the threat:

The user does not want AdTech to be able to augment its profile of them while reading articles on news.example.

The user does not want AdTech's rich profile of them to influence the content of ads or articles on news.example.

In the case of News handing AdTech a private key to do CDN things from a .news.example subdomain, the user agent will send news.example's cookies in requests for articles (and possibly in requests to the CDN subdomain). This allows the user agent to protect the user's privacy by blocking adtech.example from accessing its* cookies as third-party resource on a news.example page.

In the case of a signed package loaded from adtech.example, the user agent will send adtech.example's cookies in the request which allows AdTech to leverage its rich profile of the user to "personalize" content and ads as well as plant an AdTech user ID in the package to use in third-party requests to enrich its profile of the user.

As you said in the initial post, just because a similar attack already exists doesn't mean we shouldn't do anything about it. So I am absolutely in favor of mitigating this if we can.

However, I think it is worth considering what happens if we cannot. On balance, it seems to me that this might still be an overall improvement to security, because of the https/CDN case.

The attack described here is possible when news.example chose to let adtech.example do the crypto on their behalf. But they can (and should) do it themselves. However, in the https/CDN case, news.example has no choice: if it wants cdn.example to do https on the unchanged URLs, it has to hand over it's private keys. However, with signed exchanges, it becomes possible for news.example to sign its content itself, and have the signed package be delivered via CDNs without revealing its private keys to anyone.

Unless I am misunderstanding this, this means that while the introduction of signed packages may make it tempting to "do the wrong thing" (share your private keys) in more cases, it also makes it possible to do the right thing (do all the signing yourself) in cases where it previously was not. Whether that's a net positive probably depends on how strong the temptation is (i.e. how easy it is to sign packages yourself, how much will addTech.example pay you to do it on your behalf, etc).

Given my explanation above, I'll let you revisit your analysis before commenting further.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WICG/webpackage/issues/422#issuecomment-485050921, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAV4T7K56Y2MSSAM5EN423PRJ4D5ANCNFSM4HHBF6UQ .

frivoal commented 5 years ago

Given my explanation above, I'll let you revisit your analysis before commenting further.

Thanks, I had indeed missed that key distinction. Revising what I said earlier, my understanding is now:

The ability to inject customized/undesired things into the page is the same for the https+CDN and for the adTech packager
The ability to use cookies meant for adTech to identify the user in order to inform what kind of custom/undesired things to inject into the page is not the same in both scenario

You were focused on the second thing, while I was on the first.

That said, I wonder if the ability to modify the page before serving it and to inject arbitrary stuff along the way does not enable the malicious CDN to get back the same information with additional network requests from the page once it is loaded. Maybe blocking third party cookies effectively prevents this, but I don't feel overly confident. Once you hand your private keys to a third party, it seems hard to limit what they can do.

johnwilander commented 5 years ago

Hi Malte!

Question: the same planting of IDs can always be done via link augmentation. My understanding is that Safari is trying to protect against that by blocking query/fragment on cross origin navigation. Could an SXG navigation be made equivalent to a cross-origin navigation by saying: UA will only render the SXG if - the original request was cookieless - was a get request - has no query string or fragment - the path of the SXG request is the same as the path on the target domain. …

I avoided bringing this up to not inflate my original description and take focus off of the particular issue with built-in tracking. What you mention are additional things we'll have to do to protect signed packages, but they apply to arbitrary navigations that start on AdTech's site.

jyasskin commented 5 years ago

It might take me until Wednesday, but I'd like to check this threat model into the repository as a description of the anti-tracking requirements that at least Apple wants on the design. I'm then going to add the other attacker abilities and constraints that I think I've seen in the Twitter discussion and comments here, along with the attacker goals that we want the design to frustrate.

I think it'll be more productive to get agreement on a full understanding of the requirements before we look for solutions or try to knock over the solutions that have already been proposed.

kevinsimper commented 5 years ago

@johnwilander

But a shared TSA, funded/controlled/audited by multiple stakeholders could probably work. Also transparency will work in our favor here. It should be easy to check the integrity of the TSA, not just for UAs but for anyone.

This sounds exactly like a blockchain. (and I don't work or have investments in blockchains). It can be transparent, it is easy to check the integrity, and has many stakeholders. I know it overused and many have burnt out on the concept, but it is a valid technology. There is even work to do with "verifiable delay function".

@sleevi

Collusion by adtech.example is an (extreme) possibility, and thus it seemed important to understand the requirements here, since it seemed like there might be some unstated requirements hidden between that last bullet point. 🤔

Yeah, and I was also hesitant to suggest it as it is a misused concept for a lot of stuff. I didn't suggest any particular blockchains. Your merkledag reference gives me clues that you have considerations about it 👍

If the requirement is to keep it simple and solutions could be also to so some kind of proof of work, which would make it very expensive to make one on the fly, but cheap to do once. This also avoids contacting anybody else as the proof can be verified easily by the user agent.

jyasskin commented 5 years ago

https://github.com/WICG/webpackage/pull/424 tries to document the threat model we're trying to handle here, along with a couple notes on the mitigations I've seen proposed so far. How's it look?