WICG / turtledove

TURTLEDOVE
https://wicg.github.io/turtledove/
Other
528 stars 232 forks source link

Privacy Leak via Brower in Fledge #120

Open TheMaskMaker opened 3 years ago

TheMaskMaker commented 3 years ago

The Fledge proposal will have a huge impact on the web ecosystem. This change, ostensibly intended to protect privacy, will serve no purpose if privacy can leak via any source.

The browser itself is a major cause for concern. As currently written the Fledge proposal offers no protections preventing the browser from collecting user data, for itself or for a browser-friendly player. This would create a data monopoly while making the privacy protections of fledge useless.

Any browser/company implementing Fledge must commit to not ‘phone home’ with any data whatsoever that is not accessible by any other party (the publisher/website owner, third parties friendly to website owner, DSPs, SSPs).

Browsers already send home information that Fledge considers privacy violating. This includes analytics and diagnostics of individual users in various browsers, which currently can send their personal browsing information to a browser owned server. What oversite prevents this data form being sold? How is the browser more trustworthy than any other player such that it is permitted to handle this data? This also includes syncing functionality which keeps browsing information across devices for a particular user and therefore requires their personal and individual data sent to a server. The browser also has the power to hide these calls from developer tools requiring careful proxy analysis to detect.

Browsers must commit specifically to not send any data, in any type of request, for any reason if another player is unable to send the same data home for the same reason. This may cause even some of their legitimate use cases to stop functioning, but other players in the industry have the same problem. If the Fledge proposal is for the benefit of users and the rest of the industry is meant to make changes, the browsers must as well and hold themselves to the same standards.

Additionally, the K-anonymity calls, and server-side involved calculations, must require some sort of processing in an off-browser server owned by the browser. They then become a way browsers could cheat this restriction. Any data flow system necessary to preserve privacy should at best be under open source control, and at worst be highly auditable. The browsers must at minimum commit to a policy auditable by any other player in handling these requests. If a privacy concern would arise from this audit, then logically the browser itself is already violating the user’s privacy as it must audit itself for debugging reasons. There is no reason to trust the browser itself more than any other player or company. In fact, the browser is more likely to be able to violate a user’s privacy (according to the Fledge notion of what a violation constitutes) than any other source having access to the most cross domain information as well as other non web-api accessible pieces of information including and especially account or login information from browser-friendly or even third party sources.

In order to ensure the browser itself does not act as a privacy leak in fledge we should implement the following to the standard:

  1. We should add to Fledge browser restrictions on user data exactly equal to the restrictions placed on the most restricted player.

  2. We should add a method by which the industry as a whole can audit the K-anonymity functionality of browsers to ensure no privacy leak or monopolistic data streams exist.

michaelkleber commented 3 years ago

(Hi @TheMaskMaker, could you please add your name and affiliation to your GitHub profile? To make contributions to WICG incubations, you'll need to be covered by the W3C Community Contributor License Agreement.)

The way the browser treats people's data is indeed a crucial part of privacy. Publications like the Google Chrome Privacy Whitepaper, for example, discusses in great detail what information the Chrome browser sends to any Google server.

But that isn't part of any spec, because standards describe how features of the web work, not how browsers implement them. Indeed different browsers have a wide range of implementations, even as they expose the same standardized behavior. So statements like "Browsers must commit specifically to not send any data, in any type of request, for any reason if another player is unable to send the same data home for the same reason" doesn't really make any sense. Consider a browser like Opera Mini, in which part of rendering happens on a server, not the in the user's device — what would your statement even mean?

For the k-anonymity parts of FLEDGE and other Privacy Sandbox APIs, please take a look at the Multi-Browser Aggregation Service Explainer. This is indeed all built on infrastructure such that everyone (browser vendors, privacy researchers, etc) can get cryptographically solid proof that information is not leaking to unintended parties — even the parties who run the server-side components..

joshuakoran commented 3 years ago

This comment raises interesting points worth discussing.

If I understand the author correctly, there are (at least) three key issues being raised:

1) Whether the browser will allow for open competition or preference its parent organization's servers

2) If the browser sends a browser identifier for the various marketer and publisher use cases that require server-side processing to control, measure or optimize digital advertising, then such identifiers should be available to open market competition

3) How can publishers and marketers audit the processes occurring in the browser to ensure the publishers' control of the content on their website is not interfered with by the browser and by that same token that the marketers' control, measurement or optimization of their cross-publisher media effectiveness is not impaired

Again, not sure if I understood the original post, but it seems to suggest that if the browser is operating independently of other organizations that offer advertising solutions, then given the general lack of people's understanding of complex data processing involved with algorithmic matching of content online, the browser should not preference the sending of personal data to any particular organization's server-side processing, but instead allow for open competition to process such data.

michaelkleber commented 3 years ago

@joshuakoran The fundamental design property of the whole TURTLEDOVE family of proposals is that the browser does not send "personal data to any particular organization's server-side processing" at all!

TheMaskMaker commented 3 years ago

(Hi @TheMaskMaker, could you please add your name and affiliation to your GitHub profile? To make contributions to WICG incubations, you'll need to be covered by the W3C Community Contributor License Agreement.)

The way the browser treats people's data is indeed a crucial part of privacy. Publications like the Google Chrome Privacy Whitepaper, for example, discusses in great detail what information the Chrome browser sends to any Google server.

But that isn't part of any spec, because standards describe how features of the web work, not how browsers implement them. Indeed different browsers have a wide range of implementations, even as they expose the same standardized behavior. So statements like "Browsers must commit specifically to not send any data, in any type of request, for any reason if another player is unable to send the same data home for the same reason" doesn't really make any sense. Consider a browser like Opera Mini, in which part of rendering happens on a server, not the in the user's device — what would your statement even mean?

For the k-anonymity parts of FLEDGE and other Privacy Sandbox APIs, please take a look at the Multi-Browser Aggregation Service Explainer. This is indeed all built on infrastructure such that everyone (browser vendors, privacy researchers, etc) can get cryptographically solid proof that information is not leaking to unintended parties — even the parties who run the server-side components..

Hello Michael, worry not, I am a member of WICG and actually I've spoken to you in both meeting so far! I am the other Michael. I am happy to send you a confirmation e-mail from my registered e-mail so there is no confusion or concern there. Actually we have so little time to talk on the calls, I was wondering if I could speak to you and the community at large more at length about this over voice, though you had said to discuss via github issues, so here I am.

Thank you for linking the explainers. They are very helpful, but they do not address my main point. I think Joshua above hits on some of it, but I will clarify.

In the last meeting you yourself said that certain features and calls should be restricted in Fledge because of even a small chance of some party (usually a third partner helped or dsp/ssp) being able to theoretically obtain a piece of private information, and that Fledge should design to prevent that. Forgive me I don't recall the exact argument you made as it was very in depth; I think it was in relation to Neilson's question about data companies and before that earlier in relation to a question about events being able to send certain event level packets out of the sandbox. In any case you had said (forgive me for heavy paraphrasing) that a sandbox leak could occur if certain event level information were made available, and for this reason even if it restricted existing use cases of various companies, it could not be allowed. Restricting event level information is quite damaging to several companies that use it for legitimate use cases that do not violate privacy (and by this I mean comply with opt-in/out etc, as well as those that use events that have nothing to do with user information). I am not at the moment saying this is good or bad, but merely that the philosophy of fledge seems to be 'large change for the greater good of privacy. '

It occurred to me that applying the same logic to the browser itself gives cause for concern as a much larger leak. I am aware that use cases such as Opera Mini exist, but that is my point; they are leaks that are either being ignored, or treated differently than other leaks. This critical individual user information the fledge protocol is designed to sandbox can escape via browser calls. Even if there is a whitepaper going through all of the calls of chrome and they are safe, without a standard rule preventing changes they could occur later. Also chrome is just 1 of the browsers. Other (non chrome) browsers implementing sandbox (as naturally it would become the defacto standard if used in chrome for integration reasons) would have different calls. The risk of a leak is higher than any other component we have discussed unless I am missing something in which case please let me know.

You said this was a discussion of the standard not implementation, however I am indeed speaking of standards in this case. In fact, I don't even know at the moment how it would be implemented. The standard exists to create a framework to prevent private information from leaking. How can it not address the player that has the most data with the least visible scrutiny? Is this not the very point of fledge? (Incidentally I rather dislike text sometimes as tone is hard to convey, I am intending Socratic here :) )

In fact, the concept of a 'trusted server' is already slated to be part of the existing paradigm, so this is not something alien to fledge at all. But it does not currently apply to browsers at all, and it (as mentioned in https://github.com/WICG/turtledove/issues/114 ) gives preference to dsps/ssps in terms of information.

I think this is a large flaw in the standard we should address before origin trials for it to be successful. All players in the ecosystem should be put under the same scrutiny or privacy leaks can occur, and individual players will be shut out.

To use a very odd example, consider the penal system. Successful jail breaks most often occur not because of some hollywood style master plan, but merely due to an irregular event, such as a change in staff. Right now there is a big spotlight shining on publishers, their partners, ssps, and dsps, but each has access to different info, and this means that each one could leak in different ways. But worst of all the browser is a dark area fledge does not address with the most information, where information can leak.

I hope we can discuss this further as I think this is critical to the success of fledge as a privacy measure and I don't think we have even scratched the surface yet.

TheMaskMaker commented 3 years ago

I see this issue has started to gain traction. I'm glad about that, and I also want to keep it on track. Rather than just point out a problem I'd like to suggest a solution.

  1. Browser-sent server calls should be restricted to information available to 'the most restricted player.' The exact mechanism can be fleshed out later but I think this is a good starting point. The 'most restricted player' is likely the ad creative, or a third party on a website not authorized by the site owner. This will 100% ensure that the browser cannot be used to leak the sandbox.

However since some server calls are needed to do fledge (i.e. k-anonomity) we need a way to handle those as well:

  1. Fledge-necessary calls (such as K-anonymity calls) must be open sourced and audited to ensure honorable play. They should also be beholden to the 'trusted server' paradigm.

  2. I think we need a chart of the data accessible by each party. Its difficult to account for every case but Website owner, Third Party Authorized by website owner, Third Partner Not Authorized by website owner, DSP, SSP, Browser are certainly main ones we should consider. We should take a close look at what data each player can access.

michaelkleber commented 3 years ago

I'm afraid I can't get behind the sort of broad, sweeping statements you're trying to make, here or in the linked PR. There is a big difference between the web platform, which is the stuff of specs and standards bodies, and the browser as an app, in which different browsers can and do have extremely different behaviors and implementations.

The browser-app does, inherently, have a larger set of capabilities than are possible on the web platform. To draw a parallel example, browsers can store and auto-fill passwords. Certainly no party on the web should have the ability to see all the passwords that you enter into any site; the web platform would be completely broken if that were possible. But the browser-app can see whatever you type into the browser, can ask if you want to save the password, etc. The site you're typing the password into could offer a related sort of service — but the browser-app here is on par with the most-privileged party, not least-privileged.

I fully agree that browsers should offer lots of transparency into what they do; the whitepaper is part of that. I think @joshuakoran lists some interesting questions worth further discussion. But when you start with sweeping statements like "All Trusted Players should be able to access any data ANY OTHER trusted player should be allowed to access", I don't see a way to reconcile your goals with the entire history of the web.