TeamHG-Memex / scrapy-rotating-proxies

use multiple proxies with Scrapy
MIT License
738 stars 158 forks source link

Read proxy list from an URL #62

Open datawookie opened 3 years ago

datawookie commented 3 years ago

Hi!

We build a lot of web scrapers using Scrapy and I've been using your package for a while now. It's great for managing our multi-proxy setup.

We have been developing a proxy system that shares the proxy list via a URL. I have been dumping the contents of that URL to a file so that I can read it in via ROTATING_PROXY_LIST_PATH but this has become a bit of a pain. It occurred to me that it should be possible to read the proxy list from an URL.

The merge request includes a simple change to the RotatingProxyMiddleware.from_crawler() method to make that possible.

Example: Sharing proxy list at http://127.0.0.1:8800.

image

In settings.py I then have:

ROTATING_PROXY_LIST_PATH = 'http://127.0.0.1:8800'

For context, here's a blog post about the proxy system that we are using in conjunction with scrapy-rotating-proxies.

Best regards, Andrew.

kaybeudeker commented 2 years ago

The link to your blog post should be: https://datawookie.dev/blog/2021/10/medusa-multi-headed-tor-proxy/ (instead of pointing to localhost) ;) Great work btw!

datawookie commented 2 years ago

Thanks, @kaybeudeker, I've updated the URL. Appreciate you bringing that to my attention.

Have you tried this out? I'd really appreciate any feedback.

SashiDareddy commented 2 years ago

I had a similar use case to read proxies from an URL (specifically an API call to a third party which returns a list of proxies - exactly like you have) - I created a small utility function which uses requests.get to fetch the proxies and assigns the result to ROTATING_PROXY_LIST_PATH in settings.py.

utility function:

`def get_proxies(proxy_json_end_point: str) -> List[str]: r = requests.get(proxy_json_end_point) proxies = r.json()

proxy_urls = [
    f"http://{user}:{pwd}@{host_port}"
    for (host_port, user, pwd) in [p.split(";") for p in proxies]
]
random.shuffle(proxy_urls)
print("Proxies:", proxy_urls)
return proxy_urls`

settings.py

ROTATING_PROXY_LIST = get_act_proxies(os.getenv("PROXY_JSON_ENDPOINT"))

note - the PROXY_JSON_ENDPOINT env variable points to the third-party's API endpoint which returns the proxies. I used a similar approach to even fetch proxies listed in text file hosted in S3.

datawookie commented 2 years ago

Hi @TeamHG-Memex, any progress on this? This PR has been languishing for a few months now. Thanks, Andrew.