Open itsmylife44 opened 3 years ago
Hi @itsmylife44 , I have seen that multiple people here are logging issues with "Getting detected as bot" , implementing proxy will help in both getting blocked by website and getting detected as bot. I'm thinking of adding proxy based layer in front so that users IP remains hidden.
The program went from running great for me to almost not at all yesterday. I keep changing which sever my vpn is through and it somewhat affects the behavior. Docker on Windows.
I2021-03-21 10:48:14,014 [mzn_1] scraper initialized for https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:14,015 [bstby_1] scraper initialized for https://www.bestbuy.com/site/evga-geforce-rtx-3070-xc3-ultra-gaming-8gb-gddr6x-pci-express-4-0-graphics-card/6439299.p?skuId=6439299 I2021-03-21 10:48:14,015 [bstby_2] scraper initialized for https://www.bestbuy.com/site/nvidia-geforce-rtx-3070-8gb-gddr6-pci-express-4-0-graphics-card-dark-platinum-and-black/6429442.p?skuId=6429442 I2021-03-21 10:48:14,015 [bstby_3] scraper initialized for https://www.bestbuy.com/site/nvidia-geforce-rtx-3080-10gb-gddr6x-pci-express-4-0-graphics-card-titanium-and-black/6429440.p?skuId=6429440 W2021-03-21 10:48:15,331 [mzn_1] missing title: https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:15,333 [mzn_1] not in stock E2021-03-21 10:48:16,546 [lean_and_mean] something went wrong during request: Server disconnected E2021-03-21 10:48:16,548 [bstby_1] caught exception during request: 'NoneType' object has no attribute 'text' E2021-03-21 10:48:16,548 [bstby_1] scrape failed I2021-03-21 10:48:18,293 [bstby_2] not in stock E2021-03-21 10:48:33,930 [lean_and_mean] something went wrong during request: E2021-03-21 10:48:33,935 [bstby_3] caught exception during request: 'NoneType' object has no attribute 'text' E2021-03-21 10:48:33,935 [bstby_3] scrape failed W2021-03-21 10:48:34,481 [mzn_1] missing title: https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:34,483 [mzn_1] not in stock E2021-03-21 10:48:34,866 [lean_and_mean] something went wrong during request: Server disconnected E2021-03-21 10:48:34,869 [bstby_1] caught exception during request: 'NoneType' object has no attribute 'text' E2021-03-21 10:48:34,869 [bstby_1] scrape failed I2021-03-21 10:48:36,091 [bstby_2] not in stock I2021-03-21 10:48:37,374 [bstby_3] not in stock W2021-03-21 10:48:37,733 [mzn_1] missing title: https://www.amazon.com/dp/B08L8L71SM I2021-03-21 10:48:37,734 [mzn_1] not in stock
...and so on
here too .
i try to implement an proxy in src/worker/lean_and_mean
editing it's like :
headers = { 'accept': 'text/html', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'de-DE,de;q=0.9', 'cache-control': 'no-cache', 'dnt': '1', 'pragma': 'no-cache', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0', } timeout = request.timeout if request.timeout else 30 proxy = 'http://username:password@url:port' async with aiohttp.ClientSession(headers=headers) as session: async with session.get(request.url, timeout=timeout, proxy=proxy ) as r: response = Response() response.id = request.id response.status_code = r.status response.data = await r.text() writer.write(response.SerializeToString()) writer.write_eof() logging.info( f'sent response: id: {response.id}, status_code: {response.status_code}, data: <{len(response.data)} bytes>' )
il looks like working , but the woker does make too much requestst and could not end some requests .
with more containers and rotating proxy
I think if an Amazon scraper had an account login option , it wouldn't trigger captcha thing.
hey mate , thank you for the work . is there any chance to implement a proxy ?
cuz after some hours i noticed that on the data folder ion the html of the product , there is a captcha . " title missing . "
maybe with the proxy implementation it's possible to avoid .
thank you !