Zyte API Requests consumption when reaching FINGERPRINT_URL

fabienvauchelles / scrapoxy

Scrapoxy is a super proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers 🕸️. It also smartly handles traffic routing 🔀 to minimize bans and increase success rates 🚀.

http://scrapoxy.io

MIT License

2.05k stars 237 forks source link

Zyte API Requests consumption when reaching FINGERPRINT_URL #256

Closed IvanSaldikov closed 1 month ago

IvanSaldikov commented 1 month ago

Current Behavior

When Scrapoxy reaching FINGERPRINT_URL it consumes requests quota of Zyte API which is not effective

Expected Behavior

When Scrapoxy reaching FINGERPRINT_URL it SHOULD NOT consume requests quota of Zyte API. Maybe use direct connection istead to not consume quotas?

Steps to Reproduce

docker compose up -d
Add Zyte API connector + creds + enable it.
Go to dashboard of Zyte website and choose on the left Stats -> Requests tab, choose Today and see the requests to scrapoxy.io (see the screenshot).

Opera Snapshot_2024-10-13_195846_app zyte com

Failure Logs

No response

Scrapoxy Version

4.16.0

Custom Version

[X] No
[ ] Yes

Deployment

[X] Docker
[X] Docker Compose
[ ] Kubernetes
[ ] NPM
[ ] Other (Specify in Additional Information)

Operating System

[X] Linux
[ ] Windows
[ ] macOS
[ ] Other (Specify in Additional Information)

Storage

[X] File (default)
[ ] MongoDB & RabbitMQ
[ ] Other (Specify in Additional Information)

Additional Information

No response

fabienvauchelles commented 1 month ago

Hi @IvanSaldikov,

Thank you for using Scrapoxy!

I understand your concern, but Scrapoxy does need to connect to the fingerprint URL to establish the link and check if the proxy is working properly. You can find more details about this here: https://scrapoxy.io/intro/qna#how-much-bandwith-does-the-fingerprint-use.

To help reduce bandwidth, you can set the proxy ping interval to the maximum (30 seconds). It might also help to use the warm/hot status mechanism, which is often a good practice with Scrapoxy.

Best regards, Fabien

pbrns commented 1 month ago

I noticed the same thing, I had 10.000+ requests in one day. The default setting for proxy timeout was 5sec. I am also have the auto rotate proxy to 2m-5min.

The 30seconds interval again i think is very heavy with a lot of requests. Many requests and bandwidth will be consumed.

But for @IvanSaldikov as I am also new in the research of this project, the solution I think is as mention here "Scrapoxy requires a minimum number of proxies to maintain a stable connection; otherwise, all requests will fail. This remaining connection is essential for detecting whether Scrapoxy is receiving any activity. If traffic is detected and Auto Scale Up is enabled, Scrapoxy will change the project's status from CALM to HOT.

If you prefer not to keep at least one proxy active, please disable Auto Scale Up and use the API to manually change the project's status."

My idea on that is that if you put auto scale off, set to Proxy status instead of CALM to off then and through API you can set : Proxy Status CALM or HOT wait until connector is up (get with the api/scraper/project/connectors the proxies array will be above zero length) and then make the requests to scrapoxy. So this way no fingerprint requests.

fabienvauchelles commented 1 month ago

@pbrns Nicely put, that’s exactly the point.

Scrapoxy is indeed a proxy manager designed specifically for web scraping. In most web scraping cases, maintaining a connection (or VPS) without usage isn’t necessary, unless you’re planning to resell the connection like proxy vendors do, which isn't what Scrapoxy is built for.

While it's possible to extend the timeout to 60 seconds or more, doing so would compromise the circuit breaker functionality, which is something I’d prefer to avoid.

I highly recommend to use of the API for fingerprint requests optimisation.