Open Kikobeats opened 4 months ago
updated with an example!
Hello @Kikobeats - and thank you for your interest in this project!
All our generated data is based on collected data from real web traffic. Without going into too much detail, we have a (constantly updating) dataset of user fingerprints. These contain the user-agent
string as well as more intricate details (screen resolution, total amount of memory installed in the system etc.)
During the training phase, we take all these attributes and train a Bayesian network on them. Every possible value of any attribute is then expressed as a conditional probability of the "parent" attributes.
Now, this is where the user-agent
comes to play. In our Bayesian network, all the fingerprint fields are based on the user-agent
field. For example, let's say our training dataset had 5 records in total, 2 with user-agent: 'desktop'
, 3 with user-agent: 'mobile'
. The other fields are based on those - e.g. for screenResolution
, the probability distribution of screen sizes will be skewed towards smaller screens with user-agent:mobile
. Every fingerprint combination with non-zero conditional probability must have existed in the training data - this way, we ensure we're generating convincing fingerprints all the time.
Because of this, the user-agent
strings need to be sampled from our collection of known user-agents. If you were to submit your own free-form user-agent string, it might not be in the conditional probability tables for the other fingerprint fields and the header-generator
would not be able to generate the fingerprint.
Unfortunately, this makes this feature a wontfix
for me... But we're still curious! Is there a use case you have for this? We'd love to hear it! Hopefully, we'll be able to find another way around the problem you're trying to solve.
Cheers!
No worries and thanks for the explanation, it's really helpful to understand how the library works.
I asked for that because I already has a collection of most used user agent that is updated periodically: https://github.com/microlinkhq/top-user-agents/blob/master/src/mobile.json
This data is collected from more than 100M that are performed every month, so the sample is large enough.
In order to simulate real traffic, I want to generate realistic headers based in the user agent as input. I already did some tuning with https-tls about TLS fingerprint but I though that maybe I canse use fingerprint-suite to get realistic browser headers (sec-*
, etc).
I noted the library is at the end of the process outputting the headers that is the thing I need, so I tried to play a bit with the code to see if I would get similar headers as output but using an user agent as input.
I still think it's possible if found a way to turn the user agent into an unique browserlist match or any other way to connect it before going to bayesian network 😆 but I totally understand it's not the point of the project.
Hello,
I love the library, I have been playing with it. It's very complete with lots of data 👏.
I was wondering if it would be possible to get headers from an input user agent instead of relaying them into browserlist.
So this is supported today:
and that is what I'm suggesting:
This would be extremely helpful to have a more granular control to debug which cases can be detected or not.