EricJMarti / inventory-hunter

⚡️ Get notified as soon as your next CPU, GPU, or game console is in stock
MIT License
1.12k stars 264 forks source link

proxy implementation . #179

Open itsmylife44 opened 3 years ago

itsmylife44 commented 3 years ago

hey mate , thank you for the work . is there any chance to implement a proxy ?

cuz after some hours i noticed that on the data folder ion the html of the product , there is a captcha . " title missing . "

maybe with the proxy implementation it's possible to avoid .

thank you !

EduardoSaverin commented 3 years ago

Hi @itsmylife44 , I have seen that multiple people here are logging issues with "Getting detected as bot" , implementing proxy will help in both getting blocked by website and getting detected as bot. I'm thinking of adding proxy based layer in front so that users IP remains hidden.

thegraylackey commented 3 years ago

The program went from running great for me to almost not at all yesterday. I keep changing which sever my vpn is through and it somewhat affects the behavior. Docker on Windows.

I2021-03-21 10:48:14,014 [mzn_1] scraper initialized for https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:14,015 [bstby_1] scraper initialized for https://www.bestbuy.com/site/evga-geforce-rtx-3070-xc3-ultra-gaming-8gb-gddr6x-pci-express-4-0-graphics-card/6439299.p?skuId=6439299 I2021-03-21 10:48:14,015 [bstby_2] scraper initialized for https://www.bestbuy.com/site/nvidia-geforce-rtx-3070-8gb-gddr6-pci-express-4-0-graphics-card-dark-platinum-and-black/6429442.p?skuId=6429442 I2021-03-21 10:48:14,015 [bstby_3] scraper initialized for https://www.bestbuy.com/site/nvidia-geforce-rtx-3080-10gb-gddr6x-pci-express-4-0-graphics-card-titanium-and-black/6429440.p?skuId=6429440 W2021-03-21 10:48:15,331 [mzn_1] missing title: https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:15,333 [mzn_1] not in stock E2021-03-21 10:48:16,546 [lean_and_mean] something went wrong during request: Server disconnected E2021-03-21 10:48:16,548 [bstby_1] caught exception during request: 'NoneType' object has no attribute 'text' E2021-03-21 10:48:16,548 [bstby_1] scrape failed I2021-03-21 10:48:18,293 [bstby_2] not in stock E2021-03-21 10:48:33,930 [lean_and_mean] something went wrong during request: E2021-03-21 10:48:33,935 [bstby_3] caught exception during request: 'NoneType' object has no attribute 'text' E2021-03-21 10:48:33,935 [bstby_3] scrape failed W2021-03-21 10:48:34,481 [mzn_1] missing title: https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:34,483 [mzn_1] not in stock E2021-03-21 10:48:34,866 [lean_and_mean] something went wrong during request: Server disconnected E2021-03-21 10:48:34,869 [bstby_1] caught exception during request: 'NoneType' object has no attribute 'text' E2021-03-21 10:48:34,869 [bstby_1] scrape failed I2021-03-21 10:48:36,091 [bstby_2] not in stock I2021-03-21 10:48:37,374 [bstby_3] not in stock W2021-03-21 10:48:37,733 [mzn_1] missing title: https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:37,734 [mzn_1] not in stock

...and so on

itsmylife44 commented 3 years ago

here too .

i try to implement an proxy in src/worker/lean_and_mean

editing it's like :

    headers = {
        'accept': 'text/html',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'de-DE,de;q=0.9',
        'cache-control': 'no-cache',
        'dnt': '1',
        'pragma': 'no-cache',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
    }

    timeout = request.timeout if request.timeout else 30
    proxy = 'http://username:password@url:port'
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(request.url, timeout=timeout, proxy=proxy ) as r:
            response = Response()
            response.id = request.id
            response.status_code = r.status
            response.data = await r.text()
            writer.write(response.SerializeToString())
            writer.write_eof()
            logging.info(
                f'sent response: id: {response.id}, status_code: {response.status_code}, data: <{len(response.data)} bytes>'
            )

il looks like working , but the woker does make too much requestst and could not end some requests .

with more containers and rotating proxy

image

dinamoedm commented 3 years ago

I think if an Amazon scraper had an account login option , it wouldn't trigger captcha thing.