Skallwar / suckit

Suck the InTernet
Apache License 2.0
735 stars 38 forks source link

Proxy support #176

Closed raphCode closed 2 years ago

raphCode commented 2 years ago

I tried using a proxy to download via another IP, but couldn't get it to work via the http_proxy or HTTP_PROXY environment variables:

First, check "normal" public IP, then set a proxy, I tried some from this list until one worked: https://freeproxylists.net/ Check if the proxy works with the curl command (should return proxy IP). Lastly, run suckit with the proxy and observe that the IP in the downloaded webpage is still the "normal" IP without the proxy.

curl ifconfig.me
prox=175.144.112.239:80
http_proxy=$prox curl ifconfig.me
http_proxy=$prox suckit -v http://ifconfig.me

Still, something is done with the proxy, since an invalid IP leads to an timeout or connection failure, and the latency is increased compared to a non-proxy run.


Beside this bug, a feature idea could be to offer multiple proxies to suckit, and the requests are split across the different proxies. This can further speed up downloading since less traffic is issued from a single IP.

Skallwar commented 2 years ago

Reqwest seems to support this. https://docs.rs/reqwest/latest/reqwest/struct.Proxy.html

We could read the environment variable before creating the Downloader

raphCode commented 2 years ago

What I find strange, the environment variable is already read and processed somehow:

But, the downloaded page does not show the proxy ip for the ifconfig.me website.

Skallwar commented 2 years ago

My knowledge on proxy is quite limited but I agree with you, something is not right

Skallwar commented 2 years ago

From what I read here it should "just works"™

Skallwar commented 2 years ago

This works for me (with some warning and retry for the proxy connection):

https_proxy=147.135.134.57:9300 suckit https://ifconfig.me
https_proxy=147.135.134.57:9300 suckit http://ifconfig.me

The ip I get is not my real ip

Note that I'm using https_proxy and not http_proxy. If I used your http_proxy with a random https_proxy this is not working. I think that for some reason we are doing https request even when specifying http.

raphCode commented 2 years ago

Nice catch, I can confirm it works with https_proxy and http_proxy. It seems suckit makes https content retrieval and additional http requests for something else. If I had to guess, there is some code that resolves URLs, which is responsible for the http requests. (I remember some unit test which try to resolve an invalid lwn.net URL and looks for a redirect.)

My public server IP got blocked from scraping a particular website, so I can tell it needs both kinds of proxies to circumvent the block.


For future readers: For multithreading downloading via proxies to work, the constants MAX_EMPTY_RECEIVES and SLEEP_MILLIS may need to be adjusted upwards, otherwise all worker threads exit prematurely: They receive no work in the time interval because of the increased proxy latency.

Skallwar commented 2 years ago

My public server IP got blocked from scraping a particular website

Typical SuckIT

Should we close this?

raphCode commented 2 years ago

As far as I am concerned, yes. Except you want to keep it to open for the multiple proxy feature. This was just an idea, nothing where I would contribute personally.