brave / brave-browser

Brave browser for Android, iOS, Linux, macOS, Windows.
https://brave.com
Mozilla Public License 2.0
17.77k stars 2.32k forks source link

Brave Shields and fingerprintjs #20268

Closed m77e4t closed 2 years ago

m77e4t commented 2 years ago

8787

11770

14031

15825

15875

15825

15853

https://fingerprintjs.com/demo/ https://github.com/fingerprintjs/fingerprintjs https://community.brave.com/t/brave-browser-fingerprinting-protection-is-useless/318318 https://odysee.com/@RobBraxmanTech:6/brave:84

cc: @pes10k cc: @ShivanKaul

1.) Users are getting same fingerprint with different browser sessions, with different IP addresses, while blocking cache and cookies all together during the test.

2.) The initial theory that it was happening due to the same IP was later found out to be wrong.

3.) The second theory was that it was doing it by cache and cookies. Which should not be possible as users are blocking and clearing cookies and cache for each browser session. If it is cache polluting (I don't know exactly what it is) still it should be stopped (somehow) so not to get same fingerprint.

4.) It was said that will be solved in May 2021 itself https://github.com/brave/brave-browser/issues/15875#issuecomment-842805805. The first issue was created on Feb 2021.

5.) Team clarified that it is an upstream issue. So, why hasn't it been solved upstream after 7 months (by chromium community)?? Can it be prevented without upstream change?? If so, why hasn't it been implemented till now??

6.) If the issue was restricted/limited to the site itself it would have not been that bad (though it is), Many said that they are doing some "shady" stuff to sell their bogus product. Whatever they are doing, users are getting fingerprinted. And they are actively selling their product as of right now.

pes10k commented 2 years ago

Thank you for the comment here. I think there are several things going on here:

4.) It was said that will be solved in May 2021 itself #15875 (comment). The first issue was created on Feb 2021.

I am confident this (the external protocol handler approach) is not whats going on here. You can watch the demo page and see that they are not using external handlers to try and fingerprint users. Its true we expected a fix from upstream much sooner than we've received one (it seems the fingerprint js folks expected one too) but I'm extremely skeptical that sites are using the external handler technique in the wild; it is very slow, and very obvious when its happening (they need to open a popup window and load PDFs each time in Chromium browsers). Its something that would be good to be fixed, and if upstream doesn't do it, we'll get to it when we can, but i am certain it is not where the marginal benefit is greatest for Brave privacy efforts.

5.) Team clarified that it is an upstream issue. So, why hasn't it been solved upstream after 7 months (by chromium community)?? Can it be prevented without upstream change?? If so, why hasn't it been implemented till now??

You would need to ask the chromium team about why they have changed their scheduling here. In the meantime, we're blocking all scripts we know of that use this technique (and we work with Easylist, Easyprivacy and uBO to get those rules upstreamed when possible), providing an additional level of defense.

6.) If the issue was restricted/limited to the site itself it would have not been that bad (though it is), Many said that they are doing some "shady" stuff to sell their bogus product. Whatever they are doing, users are getting fingerprinted. And they are actively selling their product as of right now.

Again, i think fingerprint.js is sneaking by on misleading marketing here, not actual capabilities:

a. The approach here (as fingerprint js says) is to use a bunch of weak heuristics, throw them together in a classifier, and make a best effort match. This means they make impressive demos (since same UA + screen size + site being visited + time since last request is probably a good predictor of the same user on unpopular sites like fingerprintjs.com) but also leads to false positives (see the below report from the demo site; the first two are me, the third is for sure not). Put differently, what you're seeing here is not a high confidence tracking technique, its a bot flood prevention technique (which is why they don't care about false positives over time, just preventing false negatives in the short term).

Screen Shot 2021-12-30 at 14 00 34

The guts of useful fingerprinting defenses are not to make everyone look the same, or to make everyone looking different; both of those are fundamentally not possible without massive breakage. What makes Brave's defenses uniquely strong is that for naive fingerprinters, we feed them enough randomization that they can't reidentify people (everyone looks different). And for sophisticated fingerprinters, the randomization forces those fingerprinters to ignore the random-but-high-entropy inputs, and only consume a much smaller number of inputs, reducing identifiability and putting users into large anonymity sets for sites with non-trival numbers of visitors. All that is to say, fingerprint.js is doing a crummy job on their unpopular site (again, see the false positive); if they tried to do the same from popular, real-world sites like the ones they advertise at the bottom, their success rate would be even worse.

b. additionally, we block requests from sites when they call back to fingerprint.js's servers to try and use their identification-classifier-as-a-service service. In the absence of being able to talk to fingerprint js's servers, sites fall back on using the fingerprint.js library, which again Brave provides extremely strong protections against (as the fingerprint.js product conceeds).

I'm going to close this issue since there isn't a new issue to fix here, and we're already tracking this in the other issues in the issues you linked to.

That said, I appreciate you keeping our feet to the flame here. We extremely serious about preventing online tracking (fingerprinting and otherwise). I just ask some skepticism towards the folks trying to sell you a fingerprinting product, and to consider the findings of folks who are more sincerely trying to evaluate fingerprinting protections in the wild, like the EFF or privacy researchers.

Finally, I'll just note that we're continuously rolling out new fingerprinting improvements (we just landed https://github.com/brave/brave-browser/issues/18062, and have work underway for accept-language and font-fingerprinting. I can't promise a timeline for them, but i hope in we'll start rolling these out in the next month or so. These are in addition to the wide range of protections Brave already deploys.

m77e4t commented 2 years ago

I had some final comments on the topic and wanted to clarify a few things.

1.)

I just ask some skepticism towards the folks trying to sell you a fingerprinting product, and to consider the findings of folks who are more sincerely trying to evaluate fingerprinting protections in the wild, like the EFF or privacy researchers.

I am of neutral opinion on randomizing vs resisting (in reference to about:config changes made so that gecko users will have same fingerprint and hence resisting word will be used forward to refer it). Both methods are good and are doing the same thing for user's privacy but  in different ways. I wasn't the original poster of the community forum post with the inflammatory title (which I linked), I just randomly picked it up as it was the recent ongoing thread.

Furthermore, I also follow what's brave done for privacy posts, blogs etc and particularly liked the Sugarcoat and https://brave.com/research-paper-privacy-and-security-issues-in-web-3-0/. The problem I have seen with Extension based wallets like Metamask or New Brave Wallet is that, it is hard to generate multiple public addresses and using them to interact with dapps easily. So, if the Proof Of Concept actually comes to fruit, it will help for users privacy.

During this week, I studied about fingerprinting (to the extent I can), experimented, tested and found some stuff to beat fingerprintjs finally. I did on multiple OS (linux, windows), multiple devices, different IP etc and from a dozen browsers:- brave, opera, chrome, edge, firefox, tor, arkenfox.js on linux and windows. 

2.) Fingerprintjs is not exactly the way real word websites work to track, fingerprint users. But sites like fingerprintjs, coveryourtracks, amiunique, creepjs can be good starting point to see browsers fingerprinting, and I tested brave with such sites too. 

Fingerprintjs particularly uses:- i.) User Agent  ii.) Probability iii.) Device Timezone (Most Imp) iv.) Browser/Device Language

3.) Normal default firefox got fingerprinted (ID'ed) easily. But, if firefox hardened to its extent, it could pass the test from fingerprintjs. For hardening it easily, I used arkenfox.js and created a new hardened firefox profile. Arkenfox.js and Tor browser got fingerprinted in similar way, as the base firefox/gecko is hardened similarly. Both of them beat fingerprintjs, but tor needed to be safer mode rather than standard mode to beat it. Opera and Edge have their own UA, and it seemed it made them both more unique.

a.) arkenfox and tor user agent is changed from Linux firefox to windows firefox. b.) The anti-boting probability was affected as everyone looked the same. c.) Device timezone is override and changed to 0+ GMT without affecting device timezone itself.  d.) Browser language by default was changed to English (US). Other data is made same for all users (resisting) or afaik  Canvas and Webgl are randomized like Brave does.

4.)  a.) Brave UA is the same as chrome (which is a good thing). On Linux, Brave UA by default is configured for Linux itself, making it more unique. Linux is smaller compared to windows, and on top of that, majority of users on Linux seem to prefer gecko browsers over chromium browsers.  If we consider UA data from amiunique (it may not be perfect real world data), brave/chrome similarity ratio on Linux were around 1%, while default firefox on linux similarity ratio to 8%. Arkenfox/tor on linux uses Windows UA making it around 15%. If UA of brave/chrome on linux is changed to brave/chrome Windows via web store extension, it is around 7%.  When I changed the UA from linux to brave, it hard a hard time ID'ing me. Only half of the time it could correctly ID me. Even chrome with Ubo could evade it to some extent.

b.) With the extension, I was changing the UA per session. UA was changed to more recent versions of chromium rather than old ones. Due to it the anti-boting probability was also affected but to smaller extent. 

c.) Device timezone was notorious of all of them (in relation to fingerprintjs, other fingerprinting data collectors like coveryourtracks, creepjs or real world may be different). If device timezone was changed repeatedly per session or changed to GMT (0+)  even without changing UA, it had a hard time getting ID'ed.

d.) I checked my browser language, and it was English (Regional), English (UK) and English (US). I removed the other two and made English (US) as my main language on browser and on OS itself as it most used browser/device language. Naturally, you cannot randomize language as a normal English speaker user is not gonna understand Japanese and vice versa.  It seemed to have affected fingerprinting and reduced my uniqueness during individual trials.

5.) After combining all of these things, fingerprintjs could not ID me in any way. I have been quite happy with current anti-fingerprinting in strict mode by Brave, mainly that it provides it by default to a lot of users which may be using Brave due to Crypto or other things. Brave is already looking into add more things to anti-fingerprinitning by randomizing/farbling data #20096 #816 #11770 #8574. They may be seen in stable release in Q1 or by 2022. 

But, I have a doubt. Some of the things like timezone randomizing/resisting might cause problems for some users. Brave VPN with Guardian may be also affected. I fear some randomizations like timezone, UA, language changed to English (UK or US via opt-in) may be dropped even in Strict mode as to not impact normal brave users and mess up their browsing experience. Tor browser currently has 3 modes-Safe, Safer and Safest. So, we in Brave can also have 3 modes-Standard, Strict, Extreme. Extreme will be the same as 'Tor level' or appropriate term will be 'Snowden Level' with randomization done to its extent even if it breaks the web to some extent.

6.) Unrelated to this issue:- a.) Currently, as far as I know, randomization is done on session basis. Can it be scaled?? Can randomization be done per window basis, or even per tab.

b.) Can we get more customization of brave shields? I believe custom lists are available on brave nightly, which will soon be available on stable. Previously, it was hidden in a flag (brave://adblock).  Along with it, can we get privacy report data available in shield itself, as exactly which ad, tracks, scripts are blocked in a particular site a user is on. Currently, we get only a number attached to the icon. On android, it shows all trackers blocked per tracker name for all sites, but not by site basis, and it is only a report but not customizable settings per site. (it might be not under your department, but other devs from brave)

(I am just an end user with privacy enthusiasm and not a dev, so some above things written might be slightly wrong or completely wrong, consider it a random user's rambling from a brainstorming session as fingerprintjs just irritated me.)

@pes10k

Thorin-Oakenpants commented 2 years ago

Currently, as far as I know, randomization is done on session basis

Peter can correct me here if something is not quite right

For Brave it is all of these

It does this to protect the seed


since arkenfox was mentioned: arkenfox is aimed at reducing non-stateless tracking. It has never claimed to defeat [all] fingerprinting. But what it does is use the browsers built-in RFP which randomizes canvas, and it uses ETP's fingerprinters block lists - and we advise all out users to use uBlock Origin)


Please don't read too much into test sites or try to self diagnose. The path to defeating FPing is well known (see next post) and the only way to prove anything is to do large scale real-world tests to get fingerprints (per set: e.g. Brave strict. Tor Browser) and one test per browser to get the entropy per fingerprint

I am of neutral opinion on randomizing vs resisting

Both methods are "resisting" - the terms I use are "raising/randomizing vs lowering"

Both methods' end goal is to protect the metric's real value being obtained

As Peter said. A randomized metric will help fool naive scripts - the more randomized metrics the better the chances a script is naive (i.e it swallows at least one poisoned pill). You do not need to "hide in a crowd" to defeat a naive script. That is the benefit of randomizing.

As Peter said, to defeat fingerprinting, you take each metric, and render it as useless as possible to fingerprinters (protect or reduce the real value, randomize it if you can or want to, take equivalency into account if applicable, and take the threat model and audience into account). And eventually, when you end up with enough metrics covered, fingerprinting as a tracker becomes too costly (upkeep, computational/perf costs for the website etc) or impossible for advanced scripts (which require a crowd)

Advanced scripts (advanced meaning it doesn't swallow any poison randomized pills) - ultimately, all randomizing can be detected, which means it is now lowered. You don't need to use third party sites either - you can use mathematical proofs (e.g. known pixel tests), or bypass it (e.g. knowing which items are poisoned - both per metric (e.g. fonts in a font list) or overall (e.g. Brave's shield level would tell if hardwareConcurrency is unusable or not))

But the more you protect, either method, the more you are likely to succeed

Note that lowering entropy does not mean everyone looks the same. Many items are equivalency of other factors which cannot be protected, or they are required. For example, the possible fingerprints for Tor Browser (up to date and bugs/leaks aside) would be

The same applies to any browser - there are limits to how low you can lower. So that's the "buckets" and we can always work on getting that as low as possible from doing tests and using math :)

The actual entropy will depend on the spread of the set of users (such as Tor browser) in each fingerprint etc - that is there will be metrics with long thin tails - and to get those figures you need to do real-world large scale studies, one result per browser

While we can't make everyone the same, that doesn't mean you can't make fingerprinting useless - but it takes a crowd, the larger the crowd, the better. And of course it depends on your audience and threat model. Take Brave for example, where the timezone won't be 1 bucket - it doesn't meet Brave's threat model or audience - which aims to defeat naive scripts - but hopefully with more and more metrics being covered, and a very large and active crowd of Brave users, it becomes harder and harder for advanced scripts

I agree with Peter that the FPJS demo is misleading. I can easily fool it into thinking I am unique and have never visited before, and that's without changing IP or blocking their servers or changing anything to do with the browser chrome/position/sizes. However, don't be fooled - a fingerprint is just a snapshot of data that can be manipulated after the fact and can still be used to linkify

so tl;dr: brave randomizes for a reason, and RFP/Tor Browser lowers for a reason. Both methods are effective, with randomizing having some extra benefits (easy to handle compat). But ultimately both require a large crowd and enough metrics covered to beat advanced scripts - and I wouldn't call FPJS that advanced


Peter said

And for sophisticated fingerprinters, the randomization forces those fingerprinters to ignore the random-but-high-entropy inputs, and only consume a much smaller number of inputs, reducing identifiability and putting users into large anonymity sets for sites with non-trival numbers of visitors.

The same holds for lowering entropy :)

pes10k commented 2 years ago

Peter can correct me here if something is not quite right

This is exactly right :)

While we can't make everyone the same, that doesn't mean you can't make fingerprinting useless - but it takes a crowd, the larger the crowd, the better. And of course it depends on your audience and threat model. Take Brave for example, where the timezone won't be 1 bucket - it doesn't meet Brave's threat model or audience - which aims to defeat naive scripts - but hopefully with more and more metrics being covered, and a very large and active crowd of Brave users, it becomes harder and harder for advanced scripts

Said better than i could have

Peter said… The same holds for lowering entropy :)

Yep, i didn't mean to disagree, or suggest that randomization is better in tis regard

Thorin-Oakenpants commented 2 years ago

emphasis mine

And for sophisticated fingerprinters... into large anonymity sets for sites with non-trival numbers of visitors

@pes10k sorry to nitpick, not! :) I missed this bit

First, to be more exact, I think you mean non-trivial numbers of visits per fingerprint

IMO, the number of these per site has nothing to do with stateless tracking. If the fingerprint is common, then there is no way for the site to know who is visiting (OpSec aside). They can make assumptions, but only for that 1st party and is it's still useless to linkify traffic across parties

example

So sure, the more traffic with your FP the less certainty - is that what you meant? But again, without solving the IP issue, I think this is moot

If you then add in user behavior factors (which IMO is not the purview of browser anti-FPing except to create large crowds by default). e.g. if fingerprintB always visited at roughly the same times and/or days of the week and/or the same site pages - then that's true - the larger the crowd with your FP, the less correlation

[1] If the IP problem is not solved, then your complete FP is ultimately too hard to render useless, IMO

drain commented 2 years ago

brave_wC2TkZwz9s I've found a big factor on how they're fingerprinting, but I think this is also intentional. If you're unaware brave sends unusually unique strings as plugin data that only changes per profile.

Thorin-Oakenpants commented 2 years ago

@drain that's randomized data that is meant to be unique

m77e4t commented 2 years ago

Yo Thorin, could you check out this issue https://github.com/brave/brave-browser/issues/20924, is it doable? Was unable to tag you.

m77e4t commented 2 years ago

A new privacy related update just dropped https://brave.com/privacy-updates/17-language-fingerprinting/.

d.) I checked my browser language, and it was English (Regional), English (UK) and English (US). I removed the other two and made English (US) as my main language on browser and on OS itself as it most used browser/device language. Naturally, you cannot randomize language as a normal English speaker user is not gonna understand Japanese and vice versa. It seemed to have affected fingerprinting and reduced my uniqueness during individual trials.

It will likely solve this topic. On aggressive mode, it will only send data regarding English (US) and other randomized data, rather than the way it was (I had) configured it before. On aggressive mode, there was no such thing before, so unknowingly I was getting uniquely fingerprinted via enabling English (Regional) (before I knew how to stop it).

The only thing remaining will be timezone.

I will see how it works with fingerprintjs just for fun.

useranon350 commented 4 months ago

I wonder if we could get timezone randomization as well. I was experimenting with the site, and changing the TZ variable to different values within the same offset confused the tracking code substantially. e.g. TZ='America/Kentucky/Louisville' vs TZ='America/Indiana/Vincennes' both give me equivalent results in terms of the displayed time, but appear as different timezones. Adding in additional countries and continents would likely be even more effective, although daylight savings time would make that somewhat more complicated.

It seems to me that this would remove timezone from the list of data points which can be used to fingerprint.