claffin / cloudproxy

Hide your scrapers IP behind the cloud. Provision proxy servers across different cloud providers to improve your scraping success.
https://cloudproxy.io/
MIT License
1.4k stars 79 forks source link

Requests to AWS starts throwing [Errno 113] No route to host #69

Closed secretobserve closed 1 year ago

secretobserve commented 1 year ago

I've run into an issue that I can't seem to pinpoint so I'm not sure if it's due to CloudProxy (TinyProxy).

I've set up CloudProxy to run in Docker with 15 AWS Spot instances. Then I've written a Python Flask script that fetches the IPs from CloudProxy once every minute, accepts an URL (GET request), and returns the html page fetched through one of these AWS proxies. The reason I'm doing it this way is because my original application that uses the html data doesn't allow me to set the user agent, so I need to go through a proxy that allows this.

This is the fetch line in the Flask application (proxy):

proxies = {"http": proxy, "https": proxy} resp = requests.get(url, headers=headers, proxies=proxies, timeout=5, allow_redirects=True, stream=True)

It can run fine for hours until suddenly all my AWS instances started dying. I went through the CloudProxy code and identified that the restarts was due to the ALIVE checks failing. So I disabled that code and also added some exceptions in my own application. It solved the instances dying, but not the original issue.

It turned out that the code line above (requests.get) suddenly starts throwing the following error:

HTTPConnectionPool(host='X.X.X.X', port=8899): Max retries exceeded with url: http://www.url.com?page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39cd7d7550>: Failed to establish a new connection: [Errno 113] No route to host')))

I've masked the IP and url for privacy reasons.

So basically, my scripts run for hours with 1-2 requests every second until all the requests suddenly starts spitting out the exception above until it bogged down my entire WiFi. The internet on all of my computers almost stops responding. The only solution is to stop the requests, give it a few minutes and then resume like nothing happened.

After fixing the Spot instances dying, my second idea was that there's some kind of TCP limit in AWS. So I upgraded my instances from Nano to Micro with no apparent improvement. I considered it being a Docker issue but I only fetch ALIVE IPs once every minute so I can't see how that would be limited in any way. I don't see it being a TinyProxy limit since my 1-2 requests are spread out over 15 different AWS instances.

Do you know if there is any AWS limit I'm hitting or have you experienced anything similar with CloudProxy?

claffin commented 1 year ago

The ALIVE checks are a very basic check, essentially Cloudproxy does a proxy request using the deployed proxy, in this case your AWS proxy that is deployed. It does this by calling https://api.ipify.org/ and checks the response is the same as the IP of the proxy you've deployed. Ipify has no limits so shouldn't be causing this.

Is there a chance a local firewall or ISP issues are causing drop outs? When this issue occurs, are you able to use the proxies as normal? Though Cloudproxy may start replacing them all before you can test this I assume.

secretobserve commented 1 year ago

Thank you for the reply.

I agree that the ALIVE check is an unlikely culprit, other than maybe contributing to congesting the network (45 requests per minute). However, I believe commit 76581297f87e2d52c0032cd855187f9d92f01776 bypassed the ipify check so fetch_ip() isn't used anymore? Or it might be the Github search tricking me. Regardless, CloudProxy killing instances is more likely due to instances refusing connections and thereby failing the ALIVE check than CloudProxy actually causing the issue.

My main theories was that there is some kind of limit on AWS or that TinyProxy isn't shutting down open connections, causing the EC2 instance to overload and refuse new connections. This is difficult to verify though so I was primarily hoping that maybe this is a known issue. But if I'm the only having the problem, it's more likely to be on my end.

I finally figured out a way to make AWS requests directly from my main application, so I won't have to use a Python Flask script as middleman. This should halve the number of local requests needed, although it doesn't change the number of outgoing requests to AWS. It seems to be working fine for now so this issue can be marked as solved. If the problem continues, I'll just have to dial down the number of requests.

claffin commented 1 year ago

You're right, fetch_ip() isn't used anymore.

Let me know if the issue arises again. I currently do weekly automatic integration to test with all cloud providers. I will look to extend one of the tests and leave it running for a few days, see if I can repeat the error.