Ge0rg3 / requests-ip-rotator

A Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.
https://pypi.org/project/requests-ip-rotator/
GNU General Public License v3.0
1.36k stars 140 forks source link

🔧 Integrate with Browser Automation tools #61

Closed NorkzYT closed 10 months ago

NorkzYT commented 1 year ago

Hello, hope you are doing well.

Is there a potential way to integrate with tools such as Playwright, and Selenium?

I tried the following with Playwright although it does not work since it fails and returns net::ERR_TIMED_OUT. Do you have any ideas on how to go about this or a way around the issue?

from requests_ip_rotator import ApiGateway, DEFAULT_REGIONS
import os
from playwright.sync_api import sync_playwright
from dotenv import load_dotenv
load_dotenv()

# Set up and start API Gateway
aws_access_key_id = os.getenv('GOOGLE_SEARCH_AWS_ACCESS_KEY_ID')
aws_access_key_secret = os.getenv('GOOGLE_SEARCH_AWS_SECRET_ACCESS_KEY')

# Ensure that the regions you want to use are available and enabled in your AWS account.
gateway = ApiGateway("https://www.google.com", regions=DEFAULT_REGIONS,
                     access_key_id=aws_access_key_id, access_key_secret=aws_access_key_secret)
endpoints = gateway.start(force=True)

# Ensure endpoints are generated.
if not endpoints:
    raise Exception("No endpoints were created. Check your AWS configuration.")

proxy_endpoint = endpoints[0]
print("Proxy endpoint for Playwright:", proxy_endpoint)

# Use Playwright with the API Gateway as a proxy
with sync_playwright() as p:
    # Specify the correct protocol, probably HTTP
    browser = p.chromium.launch(headless=False, proxy={
                                "server": f'http://{proxy_endpoint}/ProxyStage/'})
    page = browser.new_page()
    page.goto('https://www.google.com/')
    # Do other tasks...
    page.screenshot(path='./example.png')
    browser.close()

# Shut down the API Gateway
gateway.shutdown()

I do see that this needs an http proxy instead of the current REST and by the knowledge of https://github.com/Ge0rg3/requests-ip-rotator/issues/16, http in the AWS API Gateway is not possible to use.

I will try using https://github.com/D4rkwat3r/aiohttp-ip-rotator to see if I am able to achieve a different result.

NorkzYT commented 1 year ago

I used D4rkwat3r/aiohttp-ip-rotator and I now have it working with Playwright.

Code:

import os
from aiohttp_ip_rotator import RotatingClientSession
from playwright.async_api import async_playwright
from asyncio import get_event_loop
from dotenv import load_dotenv
load_dotenv()

async def main():
    aws_access_key_id = os.getenv('GOOGLE_SEARCH_AWS_ACCESS_KEY_ID')
    aws_access_key_secret = os.getenv('GOOGLE_SEARCH_AWS_SECRET_ACCESS_KEY')

    async with RotatingClientSession(
        "https://ipchicken.com",
        aws_access_key_id,
        aws_access_key_secret
    ) as session:
        response = await session.get("https://ipchicken.com")
        # Get the endpoint URL, assuming this is how the lib returns it
        proxy_endpoint = response.url

        print("Proxy endpoint for Playwright:", proxy_endpoint)

        # Use asynchronous Playwright API
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            page = await browser.new_page()

            # Make Playwright visit the endpoint URL
            await page.goto(str(proxy_endpoint))

            # Do other tasks...
            await page.screenshot(path='./example.png')
            await browser.close()

if __name__ == "__main__":
    loop = get_event_loop()
    loop.run_until_complete(main())

Screenshot output 1: example

Screenshot output 2: example

NorkzYT commented 1 year ago

Will this project and https://github.com/D4rkwat3r/aiohttp-ip-rotator be kept separate or one day merge? @Ge0rg3 @D4rkwat3r

NorkzYT commented 1 year ago

For Scraping Google Search results, here is an example:

import os
from aiohttp_ip_rotator import RotatingClientSession
from playwright.async_api import async_playwright
from asyncio import get_event_loop
import urllib.parse
from dotenv import load_dotenv

load_dotenv()

query = "USA Oakland CA Skyline High"
encoded_query = urllib.parse.quote(query)  # URL encode the query

# Now, insert the encoded query into the URL:
url = f"https://www.google.com/search?q={encoded_query}"

async def scrape_google_results(page):
    # Define a list to store the search results
    search_results = []

    # Wait for the search results container to be loaded
    await page.wait_for_selector('div#search')

    # Extract the search results
    # Google uses `.tF2Cxc` for individual search results
    results = await page.query_selector_all('.tF2Cxc')

    # Loop through each search result and extract the required information
    for result in results:
        title_element = await result.query_selector('h3')
        link_element = await result.query_selector('a')

        title = await title_element.inner_text() if title_element else None
        link = await link_element.get_attribute('href') if link_element else None

        # Add more extraction logic as needed

        # Append the result to the list
        search_results.append({
            'title': title,
            'link': link,
        })

    return search_results

async def scroll_to_end(page):
    """Scroll to the end of the page multiple times to load all content."""
    for _ in range(3):  # Adjust the range as needed
        await page.eval_on_selector("body", "body => window.scrollTo(0, body.scrollHeight)")
        # Wait for a second to allow content to load
        await page.wait_for_timeout(1000)

async def main():
    aws_access_key_id = os.getenv('GOOGLE_SEARCH_AWS_ACCESS_KEY_ID')
    aws_access_key_secret = os.getenv('GOOGLE_SEARCH_AWS_SECRET_ACCESS_KEY')

    async with RotatingClientSession(
        url,
        aws_access_key_id,
        aws_access_key_secret
    ) as session:
        response = await session.get(url)
        # Get the endpoint URL, assuming this is how the lib returns it
        proxy_endpoint = response.url

        print("Proxy endpoint for Playwright:", proxy_endpoint)

        # Use asynchronous Playwright API
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            page = await browser.new_page()

            # Make Playwright visit the endpoint URL
            await page.goto(str(proxy_endpoint))

            # wait page is loaded
            await page.wait_for_load_state('networkidle')

            # Scroll to load all search results
            await scroll_to_end(page)

            # Do other tasks...
            await page.screenshot(path='./example.png', full_page=True)

            # Scrape Google results
            results = await scrape_google_results(page)
            for r in results:
                print(r)

            await browser.close()

if __name__ == "__main__":
    loop = get_event_loop()
    loop.run_until_complete(main())
Ge0rg3 commented 10 months ago

Hey @NorkzYT, sorry for the very late reply! I don't plan on merging this with aiohttp, as for running things async I just use conccurent.futures and have not faced any issue.

For browser tools, the best approach would be to run FireProx and set the browser proxy URL to your FireProx proxy.

Hope this helps, feel free to reopen the issue if you have any concerns 😊

NorkzYT commented 10 months ago

@Ge0rg3

No problem. I have not yet tested running FireProx, I appreciate the information—thank you. Have a great day.