abrahamjuliot / creepjs

Creepy device and browser fingerprinting
MIT License
1.5k stars 191 forks source link

just opening one for my research on bot detection and stuff #190

Open vis2021t opened 2 years ago

vis2021t commented 2 years ago

I looked over the tls fingerprinting, You talked about but there is something I read at akamai research where they stated that bot are able to bypass to get on gud side :- https://www.akamai.com/blog/security/bots-tampering-with-tls-to-avoid-detection

I came across a 2 step tls fingerprinting but I lost that pdf 🥲🥲 dammit

Will try to find it but do u know about it?

vis2021t commented 2 years ago

do u have something in your mind for brave, I think it's something we might be focusing on

vis2021t commented 2 years ago

Something unexpected I'm seeing, My default Chrome trust score has decreased with Something unusual I'm noticing,

Linux x86_64 Parrot os the voices section says it's unsupported even though few days ago it was supported, I haven't even touched chrome was working with Firefox + Fonts is indicating to be red

I even installed a fresh Chromium and tried same result , No software update has been done since 2 months

even though at below pannel everything is normal regarding Fonts, Things are off with speech

I looked in window SpeechSynthesis is there coming as boolean true

Images:- 20220818_180942

20220818_181001

20220818_181019

abrahamjuliot commented 2 years ago

Fonts

I refactored fonts. So, some devices will have low font score till it establishes a good trust score. It should get bumped with a higher score after a few days of traffic reporting the same fingerprint.

Fonts now use document.fonts.check in addition to loading local fonts. This has the effect of capturing more fonts on Linux and Chrome OS, and is a bypass to Brave's new font randomization, which recently began covering local font queries.

On the fonts test page, near the top, the check & load methods typically return a distinct set of fonts. I'm not sure why that is.

Speech

Voices can be a hit or miss on some devices, but these should load on refresh. The issue might be due to the voice load event not firing within the 3000ms cut off I have set.

If that is not the issue, then it might be due to the absence of local voices on the platform. On Blink, we return blocked if no local voices engines are detected. I believe this is common on Linux devices. Need to change that to unsupported.

I see it affects the crowd score, since it computes voices as being blocked. I'm planning to remove voices from impacting the score, since the timeout situation and the absence of local voices is not necessarily blocking.

abrahamjuliot commented 2 years ago

Brave

Yes, here's the tracker.

I'm planning to work on screen once it is available in Nightly.

vis2021t commented 2 years ago

Any plans over improving Brave trust score?

vis2021t commented 2 years ago

But I suppose I was having 100% trust score few days ago

with also respect to voice hmm that's weird

it's probably not firing is there no better approach for speech

vis2021t commented 2 years ago

Hi, I found something I would like u test it on a chrome headless and tell me is there anything interesting on it

navigator.userActivation I want to see what other headless browser says

++ I also could not found anything on mdn

++ I found something intresting about workerscopes

mm so aparently mm navigator.mimeTypes is undefined in any workerscope as we expected which clear us that navigator is different but when I try for connection it is present which means mdn is incomplete with WorkerScope Navigator

Hope it sounds great to u

vis2021t commented 2 years ago

The following Web APIs are available to workers: Barcode Detection API, Broadcast Channel API, Cache API, Channel Messaging API, Console API, Web Crypto API (Crypto), CustomEvent, Encoding API (TextEncoder, TextDecoder, etc.), Fetch API, FileReader, FileReaderSync (only works in workers!), FormData, ImageData, IndexedDB, Network Information API, Notifications API, Performance API (including: Performance, PerformanceEntry, PerformanceMeasure, PerformanceMark, PerformanceObserver, PerformanceResourceTiming), Promise, Server-sent events, ServiceWorkerRegistration, URL API (e.g. URL), WebGL with OffscreenCanvas (enabled behind a feature preference setting gfx.offscreencanvas.enabled), WebSocket, XMLHttpRequest.

abrahamjuliot commented 2 years ago

navigator.userActivation

I get true for both isActive and hasBeenActive in headless

workers

There's more too, here

trust score

No plans to change the scoring. Recently, I began factoring in the crowd-blending score to trust score. Getting and maintaining an A trust score should be slightly more difficult.

Regarding Brave, I'm looking for ways to directly detect randomization or restore the values.

vis2021t commented 2 years ago

see I told u the docs are incomplete on mdn

looks like we are working on a state even documentation feels incomplete haha 😂

well anyway I don't have laptop rn at hospital , my health isn't good but it's boring here ngl

so sorry if I disturb u regarding testing things for me

yea hmm I will check in a custom way if speech is they are working on my laptop

trust score

yea dude it's good I think that's better

brave

hmm I will explore myself too don't worry

if a webdriver bypassed their workerscope useragent to be headless

what else can we do to get sus on it

any thoughts

abrahamjuliot commented 2 years ago

It's all good. I enjoy testing and research. Hope you get well soon there.

We might be able to estimate the GPU brand on Brave based on the unprotected WebGL parameters. I would just need to begin tracking GPUs with the reduced parameters in the prediction section, and maybe call it gpu params 2. We can do this for Tor Browser standard mode, too.

Another thing I'm looking into is a human confidence score. For example, I would imagine automated browsers are not likely to have popular writing tool extensions like Grammarly or LanguageTool. So, we can fingerprint these extensions and more with little effort using CSS selectors. The existence of tools like this can increase the human confidence score. But, I can see this getting exploited or mocked. I wonder if bots will actually start installing these extensions. I would be funny to catch them in the act and then use it as a fingerprint 😁.

abrahamjuliot commented 2 years ago

This is not displayed yet, but I added gpuBrand tracking now on a handful of the prediction fingerprints

// available in the console
JSON.parse(sessionStorage.decryptionData).canvasPaintSystem.gpuBrand // INTEL for me
vis2021t commented 2 years ago

Hmm gpu detection like that would actually be more reliable as they would be quite hard to fake

I wonder what are the logics u use at the backend

I mean I have looked myself at the discussion about backend and even been a small part of it but I wish to know more

Headless have an problem combining with bot score

I think when headless is detected hmm it should increase the bot score to more likely a bot which is not happening as I tested on Google mobile friendly bot and yea workerscope is detecting

but I think we can explore more on worker section as there are things there which might be intresting

abrahamjuliot commented 2 years ago

logics on the back end

The server gives the canvas fingerprint a data profile that contains 3 GPU related arrays. This logic is used for systems and devices too.

gpuBrands: [
  "INTEL"
],
gpus: [
  "INTEL:ANGLE (Intel(R) UHD Graphics Direct3D11 vs_5_0 ps_5_0)",
  "INTEL:ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0, D3D11)"
],
gpuWatch: [
  "INTEL:460191600000:8/2/1984:703722......:18"
],

Restrictions are in place before a GPU is accepted in the above arrays. It must have no damage or signs of JavaScript tampering, it must have a moderate or high confidence score (with known parts), and if WebGL in worker is available, the GPU strings must match.

Deceptions can still occur through engine level tampering, so we have this final gpuWatch array that tracks each reported brand and makes the brand self-destruct from all 3 arrays it if it fails to maintain trust.

To establish trust, it just needs to get satisfy client side checks and show up on the server as a valid GPU string. To maintain trust, the brand needs to just not self-destruct under these conditions:

If the brand only has

Self-destruction on that brand is triggered by any counter brand whenever one shows up on the server.

// gpuWatch (the watcher)
[BRAND]:[timeLastSeen]:[dateStringOfLastSeen]:[hashOfDistinctTimezone][...][...][...]:[brandSeenTotal]

This design aims to auto distrust reporting if it is not supported by current web traffic.

abrahamjuliot commented 2 years ago

headless is detected should increase the bot score

I like this idea. I plan to change the bot pattern to include more headless signals. Right now, everyone gets the stranger bot level on the first visit, and then on the second visit bot patterns are computed, but we can start differentiating stranger from headless on the first visit and boost the score.

vis2021t commented 2 years ago

gpu detailed explanation

hi buddy thanks for sharing details over what happens, I will be back from hospital today

bot patterns needs a lot of change I think.

vis2021t commented 2 years ago

Hi Buddy, Something I want u to look at for a tiny bit:-

  1. abrahamjuliot github io_creepjs_

  2. codepen io_matt-west_full_DpmMgE

I feel a bit confused

abrahamjuliot commented 2 years ago

Hmmm... how about this demo? Do any voices load?

https://mdn.github.io/dom-examples/web-speech-api/speak-easy-synthesis/

Missing voices on Linux might be related to this Chromium issue.

https://bugs.chromium.org/p/chromium/issues/detail?id=586819

vis2021t commented 2 years ago

Sure I will see and let u know,

Hmm I was wondering how are those samples data created and updated?

Ofcourse they hold an important role

++ I also am curious about gpu sign near domrect etc

at crowd blending section

I saw there are Many places where there is a gpu sign (I guess thats what they meant ), and I am curious because I wasn't aware they can be used for it

Domrect , Device of timezone grabbed my attention

Hope u can explain me something on that

vis2021t commented 2 years ago

I found resource regarding guessing random number with high accuracy, does it grab some intrest of our

I mean if someone is using something for generating an enitre random number to bypass detection in some way

https://v8.dev/blog/math-random It can be very useful as I also found a trusted yt info on it

https://www.youtube.com/watch?v=-h_rj2-HP2E

Please take a look on his self research

I think If we use this and Well I am unaware of how plugins work which provide privacy etc or bot detection bypass works for companies with google_bot but till yet I have seen everyone uses random at someplace for output, Makes it even more assured that this is something big,

I think we should rawly stick with the web engine direct code understanding It will give us info more than any docs can

I was doing it on v8 and came across random number implementation So I assume my approach is quite simple yet the on of the fastest one

I suppose if anyone creates a random generated value when we call a specific thingy

...It might be good place to easily decrease trust score

I will research on getting previous generated random number by reversing the random number implementation If they generate random value previously before we called it

I think it's something really interesting which should caught up to our eye

vis2021t commented 2 years ago

Domrect , Device of timezone grabbed my attention how are those samples data created and updated?

only 2 tiny question from a kid hehe Hope u could tell me something on it

hope u found my research on random number detection somewhat useful

vis2021t commented 2 years ago

I hope u could take out time for me.....

abrahamjuliot commented 2 years ago

I'm terribly slow, but I've been looking forward to this. These are great questions.

gpu sign (domrect, device of time zone, etc)

As far as I know, DOM rect and other pixel precision fingerprints (font pixels, SVG, and text metrics) are not actually affected by graphics hardware, but I might be wrong about that. We collect the GPU brand anyhow to see what we can find. The pixel rendering uses CSS transforms, but this only impacts the frame-rate and not the precision of the dimensions.

It's possible that the prediction is accurate due to the low GPU count in the fingerprint or the hardware signature. For example, the DOM rect can be very different on certain virtual machines with unusual display resolutions (VMWare, VirtualBox, etc.). The GPU brand does not impact the pixels, but this fingerprint only includes GPUs from a single brand. A more clear example, is WebKit pixel rendering on iOS will only have the Apple GPU brand.

It will be interesting to see if we can accurately predict NVIDIA, AMD, or INTEL in pixel precision rendering after a week or so of gathering fingerprints. I can check the logs and see what good crowd score fingerprints are reporting.

how are those samples data created and updated?

It's mostly through Google Firebase/Firestore. When the page initially loads, an encrypted request is sent to the API containing the prediction fingerprints. The request is deciphered on the server before returning the results. Subsequent page loads will only resend the post if the fingerprints change. To ensure the accuracy of the samples, I set up some restrictions: certain forms of bad bot behavior (as identified by the bot hash) are not allowed to participate. Finally, I use a Google Apps Script server to automatically request the full JSON dump we have here on GitHub.

I currently am manually importing it from Apps Script to GitHub every 2 weeks or so, whenever browser features need an update. It would be great to find a way to automate this step. The file is 2 megabytes, which would make an API request from the client too costly.

to be continued...

vis2021t commented 2 years ago

I found that Google bot have abnormal hardware concurrency

which is between 112 to 128 for the average as far as I have seen which means that for Android,

ofcourse worker found out it is a headless chrome Linux

but this Part also interesting as it states at the beginning that an android with respect to its model which android user agent and other places have declared have abnormal specs, I think we can make a chart

with respect to js engine and what they declare etc , there are few places it can be usefull

  1. I have been trying to make a smt model which uses same algorithm but in reversed way to get previously generated random number with the sample

I will test it today and let u know

  1. to automate a small file like that mm maybe a simple bot connected to the apps script using a websocket to have a bidirectional connection and immediate update

and to update to client with less performance hamper

hmm have to think

vis2021t commented 2 years ago

the random number prediction method works only on chrome and nodejs

The method is different for spider-monkey

vis2021t commented 2 years ago

we could make an self feeding smt solver till a limit and we use that to determine random number but my curiosity goes to if we could go ahead I think we might go back

like first we need to understand how the script is working then we can work with reverse and seeing that did u got any similar pattern anywhere?

on future prediction or past prediction

abrahamjuliot commented 2 years ago

guessing random number with high accuracy

This is incredible. Watched the video too. I wonder if we could use JavaScript or WASM to predict Math.random. The difficulty is determining what to target. I can think of some counter-attacks that could handle randomization using Web Crypto, WASM, or by changing the engine (like this browser-- it's advanced but has a handful of leaks).

vis2021t commented 2 years ago

I was thinking if a headless browser device which says and pretend to have gyro and is using their values some fixed values and the end values with math.random which is a pretty trick i have seen but we can detect much more things using math.random with it for catching lies

it can be a whole another level this can be a whole another level thing

vis2021t commented 2 years ago

Check this out :- https://github.com/chromium/chromium

vis2021t commented 2 years ago

https://incolumitas.com/2021/01/10/browser-based-port-scanning/

Why don't we test it out and see what bots do

I saw puppeteer extra stealth plug-in have somewhat control over service worker or shared worker

one of them

so as bidi new browser automation method I think it might be interesting to check for ports

self localhost port scanning

abrahamjuliot commented 2 years ago

Check this out :- https://github.com/chromium/chromium

Nice. I sometimes use these...

https://source.chromium.org/ https://searchfox.org/

https://chromestatus.com/roadmap https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Releases

vis2021t commented 2 years ago

Nice. I sometimes use these

I will be looking at the headless and normal one really deeply including v8 engine source

I feel like I am becoming the browser 😂😂 lol

vis2021t commented 2 years ago

Had any thoughts over random number predictions?

I think we can use it for many things but I wanna know if u have anything in your mind

abrahamjuliot commented 2 years ago

Hmmm... thinking on this, but nothing yet. I'm guessing we would need a target script. For example, if the script is getting a random number within a random range, we would need to know that both the range and the final number need to be decodes.

vis2021t commented 2 years ago

Hmmm... thinking on this, but nothing yet. I'm guessing we would need a target script. For example, if the script is getting a random number within a random range, we would need to know that both the range and the final number need to be decodes.

I think we should checkout privacy extensions and how they using random

abrahamjuliot commented 2 years ago

Here are a few examples

https://github.com/kkoooqq/fakebrowser/blob/main/src/core/DeviceDescriptor.ts#L463 https://github.com/unblocked-web/unblocked/blob/main/plugins/default-browser-emulator/lib/Viewports.ts#L69 https://github.com/duckduckgo/content-scope-scripts/blob/main/build/chrome/inject.js#L19 https://github.com/jake-cryptic/AbsoluteDoubleTrace/blob/master/MyTrace/js/contentscript/page.js#L94

We use these to generate random traps in canvas and audio

// color
~~(Math.random() * 256)
// frequency
getRandFromRange = (min, max) => Math.floor(Math.random() * (max - min + 1)) + min
max = 20
start = getRandFromRange(275, length - (max + 1)); // random index between 275 and 1979

Used for encryption in post requests

crypto.getRandomValues(new Uint8Array(12))
vis2021t commented 2 years ago

Hmm, I see math.random is quite much used

If we could carefully understand how they are trying to keeping random not looking like random

and use our method to see in pattern match

we it can be interesting,

lol who knew math.random can be an issue 😂😂

vis2021t commented 2 years ago

I am doing self-bugs finding maybe will find a bug, I am a browser now

(beep beep - boop boop) sandbox

What are u doing these days?

abrahamjuliot commented 2 years ago

Right now, I'm working on a new analysis API. The current analysis response is displayed in the console, but the goal is to show this in a section and tag suspicious traffic.

vis2021t commented 2 years ago

new analysis API.

I was thinking that we can like the way u are doing for gpu

we can also do for smartphones , u won't imagine a phone with 127 hardware concurrency and stating to be an android Pixel with shiftshader lol

I mean I think even when it comes to things

The stats of specs js gives us won't be higher than what is stated on the device model u know, it won't go higher from its original specs

if they don't have model that's a different condition but if they do

this can be a simple technique 👌

abrahamjuliot commented 2 years ago

smartphones

I like this idea. iPhone would be easy to verify. We just need to validate the engine is WebKit. Android on Firefox and Chrome has dozens of features not on desktop. We could run a mobile test based on this.

vis2021t commented 2 years ago

My buddie llked my idea

Hehe true we had a first glance at this thingy with navigator.connection.type

declaring that there are many things which are not available at other places declaring that well Javascript is different at other places

vis2021t commented 2 years ago

I have not tested one thing which I wish to test , in chrome

there was a vulnerability of code execution but as browser are sandbox based

I was thinking not going out of sandbox but is it possible to access resources of chrome:// url? which is by default restrict for js

I mean that url is really really great for many things it hold a lot of info

abrahamjuliot commented 2 years ago

I don't think so. Pretty sure those are locked down, but maybe there's a bug to get around it.

vis2021t commented 2 years ago

I don't think so. Pretty sure those are locked down, but maybe there's a bug to get around it.

True will look in it, Specially the version chrome://version

have a path information, it's like example

data/user/0/com.kiwibrowser.browser/app_chrome/Default

I think these places can also be useful if we could sum up bugs around it for that

that's why I'm understanding V8 js engine etc

pretty sure the many things under chrome:// can't be faked Screenshot_20220907-150928_Kiwi Browser

vis2021t commented 2 years ago

Was this your choice or were u inspired by somewhere :- Screenshot_20220908-151245_Termux

😆😆

abrahamjuliot commented 2 years ago

Lol. It's an arbitrary selection based on unique patterns here.

I need to create a better test page and render by platform font and then create sections by versions. I believe the older versions have much faster rendering performance.

https://emojipedia.org/emoji-versions/ and there's some overlap at https://emojipedia.org/unicode-versions/

vis2021t commented 2 years ago

lmao can't believe we are gonna cross reach comments

I cloned the website which was rendered of googlebot and almost every single icon in crowdblend was detected as linux

except the screen it showed to be android

I think bots only hide basic useragent and screen etc Their even dedicated worker useragent had headless in it lol

Do u have seen any smart bots in your server side reviews

or maybe something like u felt it could be a bot ?

abrahamjuliot commented 2 years ago

I see 3 types of traffic on CreepJS

I'm not certain any of these are automated until I examine the request timing and delay pattern. For example, there are spikes that generate ~500 request in less than 10 minutes (with less than organic delay) and produce 100s of DOMRect fingerprints but only 1 SVGRect fingerprint. SVGRect uses the same DOMRect interface and should produce the same amount of fingerprints. CSS pixels should also yield a similar distinct count.

Other spikes can look natural and produce a reasonable 500 requests in under an hour, but then it contains 300 GPU strings and the timing looks more like a hit-and-run operation. Perhaps the developer is putting in some exercise tests on the crowd blend API. I'm often able to single it out as a unique fingerprint by just looking at the stack value, which is not easy to fake.

vis2021t commented 2 years ago

Types of Traffic Creepjs gets

I understood,

I am curious what do u use browser sources for?

for example chromium source etc?