apify / fingerprint-suite

Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.
Apache License 2.0
965 stars 102 forks source link

header-generator fails luminati botcheck #152

Open corford opened 1 year ago

corford commented 1 year ago

Describe the bug

Multiple header and TLS tests fail when visiting: https://botcheck.luminati.io/

To Reproduce

Headers present on request (when emulating Chrome 110 Windows):

Host: headers.cf
Connection: close
Content-Length: 0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Sec-Fetch-Site: same-site
Sec-Fetch-User: ?1
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Sec-Ch-Ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"

Response from botcheck:

Type navigate
PASS User agent
FAIL Header values: sent headers do not match what is expected
  sec-fetch-site
    + same-site
    - none
PASS Header case
FAIL Header order: header order is incorrect
  :path
    + 1
    - 3
  :authority
    + 2
    - 1
  :scheme
    + 3
    - 2
PASS HTTP version
PASS TLS version
FAIL TLS cipher
  + 130113021303c02bc02fc02cc030cca9cca8c013c014009c009d002f003500ff
FAIL Http2 settings
  headerTableSize
    + 4096
    - 65536
  initialWindowSize
    + 33554432
    - 6291456
  maxConcurrentStreams
    + 4294967295
    - 1000
  maxHeaderListSize
    + 4294967295
    - 262144
  maxHeaderSize
    + 4294967295
    - 262144

Headers present on request (when emulating Firefox 110 Windows):

Host: headers.cf
Connection: close
Content-Length: 0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Upgrade-Insecure-Requests: 1
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Sec-Fetch-Site: same-site
Sec-Fetch-User: ?1

Response from botcheck:

Type navigate
PASS User agent
PASS Header values
WARN Tricky headers: using these headers incorrectly may impact success rate
  - sec-fetch-site
  - sec-fetch-mode
  - sec-fetch-user
  - sec-fetch-dest
PASS Header case
PASS Header order
PASS HTTP version
PASS TLS version
FAIL TLS cipher
  + 130113031302c02bc02fcca9cca8c02cc030c00ac009c013c014009c009d002f003500ff
FAIL Http2 settings
  headerTableSize
    + 4096
    - 65536
  enablePush
    + false
    - true
  initialWindowSize
    + 33554432
    - 131072

Expected behaviour

All tests should pass (which is the case if you visit https://botcheck.luminati.io/ using a real Chrome 110 or Firefox 110 browser on Windows).

System information:

Additional context

Add any other context about the problem here.

mnmkng commented 1 year ago

@Equidem could you please take a look at the header orders and incorrect headers present together? That points to some issue in the generation code or the way we process the fingerprints we collect. Since we work with correct fingerprints, we should not get incorrect results.

The TLS is a bit tricky because Node.js does not allow the same level of configuration, but we can try to look again at this as well.

barjin commented 1 year ago

I guess @Equidem won't recognize his own code, I have basically rewritten the whole thing since it was forged in the ancient flames 😄

This is likely related to apify/got-scraping#65 (mentioning the same problems) and will be partially solved by #149, which introduces an automatic way of updating the header orders. As far as I can see, the only incorrect header is sec-fetch-site, which does not identify the user - it says in what context the request was made (think CORS - same-site, cross-site...) Since got-scraping cannot execute client-side JS, the only valid value here is none (user initiated request). This is not a problem with the collected data, but with our methodology - which is easy to fix in got-scraping.

yovanoc commented 1 year ago

even arc or opera don't pass this check

tenkuken commented 1 year ago

I found some antibot detection based on http header orders. https://my.f5.com/manage/s/article/K13527565

Suniron commented 1 year ago

Edge Chromium don't pass the check also.. 🙄