constverum / ProxyBroker

Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:
http://proxybroker.readthedocs.io
Apache License 2.0
3.83k stars 1.08k forks source link

ANYONE???? How do I insert proxybroker into python script #133

Open adiaz509 opened 5 years ago

adiaz509 commented 5 years ago

Simply, how do I insert ProxyBroker for proxy rotation into my scraper? I want proxybroker to have country be US ex: countries = ['US']

I have tried looking at the readme - https://proxybroker.readthedocs.io/en/latest/examples.html - Ive been told "Just do export https_proxy=http://host:port before you run the script where host:port is your proxy rotator" which i understand but dont know how to do

Here is my set up without proxybroker

import os
import requests
import re
import backoff
from bs4 import BeautifulSoup
from fake_useragent import UserAgent 

for i in range (10):
    asin_list = ['B07K853G1K']
#------------------------------------------------------------------
    urls = []

    def fatal_code(e):
        return 400 <= e.r.status_code < 500

    @backoff.on_exception(backoff.expo,
                  requests.exceptions.RequestException,
                  max_tries=50,
                  jitter=backoff.full_jitter,
                  giveup=fatal_code)
    @backoff.on_exception(backoff.expo,
                  requests.exceptions.HTTPError,
                  max_time=60)
    @backoff.on_exception(backoff.expo,
                  (requests.exceptions.Timeout,
                   requests.exceptions.ConnectionError),
                   max_time=60)
    def get_url(url, headers):
        return requests.get(url, headers=header)

    print('Scrape Started')

    for asin in asin_list:
        product_url = f'https://www.amazon.com/dp/{asin}'
        urls.append(product_url)
        base_search_url = 'https://www.amazon.com'

        ua = UserAgent()
        print(ua.random)
        header = {'User-Agent':str(ua.random)}
        print(header)

        while len(urls) > 0:
             url = urls.pop(0)
             r = get_url(url, headers)
             print("we got a {} response code from {}".format(r.status_code, url))

         soup = BeautifulSoup(r.text, 'lxml')

       #### scrape code below ####
Lomanic commented 5 years ago

Had the same question as it's not explained in proxybroker examples page. This is a simple way by adapting one the examples (maybe not the best as I'm not familiar with asyncio API) by testing it interactively in python REPL

import requests
from proxybroker import Broker
import asyncio
import random

# get n proxies from proxybroker
def getProxies(n):
    async def show(proxies):
        p = []
        while True:
                proxy = await proxies.get()
                if proxy is None: break
                p.append("{}://{}:{}".format(proxy.schemes[0].lower(), proxy.host, proxy.port))
        return p

    proxies = asyncio.Queue()
    broker = Broker(proxies)
    tasks = asyncio.gather(broker.find(types=['HTTP', 'HTTPS'], limit=n), show(proxies))
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(tasks)[1]

def main():
    proxyPool = getProxies(5)
    random.shuffle(proxyPool)
    for proxy in proxyPool:
        try:
            print(proxy, requests.get("https://v4.ident.me/", proxies={"http": proxy, "https": proxy}).text.strip())
        except requests.exceptions.ProxyError:
            pass

if __name__ == '__main__':
    main()