Closed NorkzYT closed 10 months ago
I used D4rkwat3r/aiohttp-ip-rotator and I now have it working with Playwright.
Code:
import os
from aiohttp_ip_rotator import RotatingClientSession
from playwright.async_api import async_playwright
from asyncio import get_event_loop
from dotenv import load_dotenv
load_dotenv()
async def main():
aws_access_key_id = os.getenv('GOOGLE_SEARCH_AWS_ACCESS_KEY_ID')
aws_access_key_secret = os.getenv('GOOGLE_SEARCH_AWS_SECRET_ACCESS_KEY')
async with RotatingClientSession(
"https://ipchicken.com",
aws_access_key_id,
aws_access_key_secret
) as session:
response = await session.get("https://ipchicken.com")
# Get the endpoint URL, assuming this is how the lib returns it
proxy_endpoint = response.url
print("Proxy endpoint for Playwright:", proxy_endpoint)
# Use asynchronous Playwright API
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
# Make Playwright visit the endpoint URL
await page.goto(str(proxy_endpoint))
# Do other tasks...
await page.screenshot(path='./example.png')
await browser.close()
if __name__ == "__main__":
loop = get_event_loop()
loop.run_until_complete(main())
Screenshot output 1:
Screenshot output 2:
Will this project and https://github.com/D4rkwat3r/aiohttp-ip-rotator be kept separate or one day merge? @Ge0rg3 @D4rkwat3r
For Scraping Google Search results, here is an example:
import os
from aiohttp_ip_rotator import RotatingClientSession
from playwright.async_api import async_playwright
from asyncio import get_event_loop
import urllib.parse
from dotenv import load_dotenv
load_dotenv()
query = "USA Oakland CA Skyline High"
encoded_query = urllib.parse.quote(query) # URL encode the query
# Now, insert the encoded query into the URL:
url = f"https://www.google.com/search?q={encoded_query}"
async def scrape_google_results(page):
# Define a list to store the search results
search_results = []
# Wait for the search results container to be loaded
await page.wait_for_selector('div#search')
# Extract the search results
# Google uses `.tF2Cxc` for individual search results
results = await page.query_selector_all('.tF2Cxc')
# Loop through each search result and extract the required information
for result in results:
title_element = await result.query_selector('h3')
link_element = await result.query_selector('a')
title = await title_element.inner_text() if title_element else None
link = await link_element.get_attribute('href') if link_element else None
# Add more extraction logic as needed
# Append the result to the list
search_results.append({
'title': title,
'link': link,
})
return search_results
async def scroll_to_end(page):
"""Scroll to the end of the page multiple times to load all content."""
for _ in range(3): # Adjust the range as needed
await page.eval_on_selector("body", "body => window.scrollTo(0, body.scrollHeight)")
# Wait for a second to allow content to load
await page.wait_for_timeout(1000)
async def main():
aws_access_key_id = os.getenv('GOOGLE_SEARCH_AWS_ACCESS_KEY_ID')
aws_access_key_secret = os.getenv('GOOGLE_SEARCH_AWS_SECRET_ACCESS_KEY')
async with RotatingClientSession(
url,
aws_access_key_id,
aws_access_key_secret
) as session:
response = await session.get(url)
# Get the endpoint URL, assuming this is how the lib returns it
proxy_endpoint = response.url
print("Proxy endpoint for Playwright:", proxy_endpoint)
# Use asynchronous Playwright API
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
# Make Playwright visit the endpoint URL
await page.goto(str(proxy_endpoint))
# wait page is loaded
await page.wait_for_load_state('networkidle')
# Scroll to load all search results
await scroll_to_end(page)
# Do other tasks...
await page.screenshot(path='./example.png', full_page=True)
# Scrape Google results
results = await scrape_google_results(page)
for r in results:
print(r)
await browser.close()
if __name__ == "__main__":
loop = get_event_loop()
loop.run_until_complete(main())
Hey @NorkzYT, sorry for the very late reply! I don't plan on merging this with aiohttp, as for running things async I just use conccurent.futures
and have not faced any issue.
For browser tools, the best approach would be to run FireProx and set the browser proxy URL to your FireProx proxy.
Hope this helps, feel free to reopen the issue if you have any concerns 😊
@Ge0rg3
No problem. I have not yet tested running FireProx, I appreciate the information—thank you. Have a great day.
Hello, hope you are doing well.
Is there a potential way to integrate with tools such as Playwright, and Selenium?
I tried the following with Playwright although it does not work since it fails and returns
net::ERR_TIMED_OUT
. Do you have any ideas on how to go about this or a way around the issue?I do see that this needs an
http
proxy instead of the current REST and by the knowledge of https://github.com/Ge0rg3/requests-ip-rotator/issues/16,http
in the AWS API Gateway is not possible to use.I will try using https://github.com/D4rkwat3r/aiohttp-ip-rotator to see if I am able to achieve a different result.