apify / got-scraping

HTTP client made for scraping based on got.
422 stars 32 forks source link

feat: add Cloudflare blocking benchmarking #114

Closed barjin closed 8 months ago

barjin commented 8 months ago

Port of the fingerprint-suite benchmark for comparing the anti-blocking capabilities of different got versions (got vs got-scraping).

A quick run shows that the ESM port didn't mess up the antiblocking features in any way, here are the results:

got-scraping@latest (3.2.15)
{ passed: 103, blocked: 21, failed: 15 }
---
got-scraping@4.0.0
{ passed: 103, blocked: 20, failed: 16 }
---
got@13.0.0
{ passed: 69, blocked: 0, failed: 70 }

@B4nan , @vladfrangu I suppose there is nothing else blocking the release of got-scraping@4.0.0 then? :)

B4nan commented 8 months ago

Nice!

@vladfrangu I would very much like to ship this (including the crawlee version which will use it) before the monday sprint demo, we dont need to wait for the two issues if fixing them wont be breaking (which I dont think it will). so if we wont get a response in next few hours, i'd just ship it tomorrow and improve those two things later

B4nan commented 8 months ago

I see one difference in the tests, one test was blocked before and now it fails completely, what exactly does that mean?

got-scraping@latest (3.2.15)
{ passed: 103, blocked: 21, failed: 15 } // blocked 21
---
got-scraping@4.0.0
{ passed: 103, blocked: 20, failed: 16 } // blocked 20
vladfrangu commented 8 months ago

Nice!

@vladfrangu I would very much like to ship this (including the crawlee version which will use it) before the monday sprint demo, we dont need to wait for the two issues if fixing them wont be breaking (which I dont think it will). so if we wont get a response in next few hours, i'd just ship it tomorrow and improve those two things later

Agreed! I also can't exactly see why it would happen, and without some repro samples, I'd say lets get this out! I'll rebase my got-scraping PRs for crawlee after the release 🎉

barjin commented 8 months ago

I see one difference in the tests, one test was blocked before and now it fails completely, what exactly does that mean?

got-scraping@latest (3.2.15)
{ passed: 103, blocked: 21, failed: 15 } // blocked 21
---
got-scraping@4.0.0
{ passed: 103, blocked: 20, failed: 16 } // blocked 20

Bear in mind that these are results of three consecutive runs (no reruns etc.), so any discrepancies of this order are just a statistical error (unresponsive server etc.). I've ran it again just now and the results are different - but you can still see the benefits of the header injection.

got-scraping@4.0.0
{ passed: 104, blocked: 20, failed: 15 }
---
got
{ passed: 72, blocked: 0, failed: 67 }