Improve browser agent' scraping/processing web content

tobitege commented 3 days ago

Summary

Currently the generated axtree content for retrieved websites incurs a huge amount of tokens and cost. Maybe below combination of Playwright with BeautifulSoup can save tokens, cost and runtime?

Here's what Gemini Pro 1.5 gave as a suggestion (general example, not within OpenHands):

You're facing a common challenge when using LLMs with web scraping: the sheer volume of HTML can overwhelm the tokenizer and lead to unnecessary costs and slower processing. Extracting the main content while preserving interactive elements is the right approach. While a single perfect package might not exist, combining existing tools and techniques can achieve this.

Here's a breakdown of a strategy using Playwright and Beautiful Soup, along with explanations and code examples:

1. Extract Content with Playwright and Beautiful Soup:

Playwright is excellent for interacting with dynamic content, while Beautiful Soup excels at parsing HTML. We'll use them together.

from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import re

async def extract_content(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)

        # Wait for dynamic content to load (adjust as needed)
        await page.wait_for_timeout(2000) # or a more robust wait condition

        html = await page.content()
        await browser.close()

        soup = BeautifulSoup(html, "html.parser")

        # Extract main content (strategies below)
        main_content = extract_main_content(soup)

        # Preserve links, adding identifiers
        modified_content = process_links(main_content)

        return modified_content

def extract_main_content(soup):
    # Strategy 1:  Using common article tags
    article_content = soup.find("article")
    if article_content:
        return article_content

    # Strategy 2:  Largest text block (heuristic)
    text_blocks = [text_block for text_block in soup.find_all(text=True) if not text_block.parent.name in ['style', 'script', 'head', '[document]']]
    largest_block = max(text_blocks, key=lambda block: len(block.strip()), default=None)
    if largest_block:
        return largest_block.parent  # Return the parent element

    # Strategy 3:  CSS Selectors (requires knowledge of the page structure)
    main_content = soup.select_one(".main-content") # Example selector
    if main_content:
        return main_content

    return soup.body # Fallback: return the whole body

def process_links(content):
    link_count = 0
    for link in content.find_all("a"):
        link_count += 1
        original_href = link.get("href")
        if original_href:
            link["href"] = f"[{link_count}]({original_href})"  # Markdown link format with ID
            link["id"] = f"link-{link_count}"
    return str(content)

# Example usage (async)
import asyncio
async def main():
    content = await extract_content("https://www.example.com")
    print(content)

asyncio.run(main())

Explanation and Improvements:

Dynamic Content Handling: The code now uses page.wait_for_timeout(2000) or other, more robust methods to handle dynamic content loading. You might need to adjust the timeout or use more specific selectors/conditions.
Main Content Extraction Strategies: The extract_main_content function now includes multiple strategies:
- Article Tag: Looks for <article> tags, commonly used for main content.
- Largest Text Block: A heuristic approach to find the element containing the most text.
- CSS Selectors: The most precise method, but requires inspecting the page's HTML structure to find unique selectors for the main content area. You can use your browser's developer tools to find these.
- Fallback: If all else fails, it returns the <body> element.
Link Processing: The process_links function now adds unique identifiers to each link using Markdown format [link text](URL), making it easier for the LLM to understand and act upon them. The link ID is also added as an attribute for easier access if needed.
Asynchronous Operations: The code uses async and await correctly for asynchronous operations with Playwright.
Error Handling: While not explicitly included, you should add error handling (e.g., try...except blocks) to handle potential issues like network errors or invalid selectors.

2. Pass to LLM:

Now you can pass the modified_content (which is a string of HTML with identifiable links) to your LLM. The reduced size and structured links will make processing significantly more efficient.

3. Agent Interaction:

When the LLM decides to "click" a link, it can refer to the link ID (e.g., "Click link 3"). Your agent can then parse this instruction, extract the corresponding URL from the original HTML (using the ID), and navigate to that URL using Playwright.

This combined approach addresses your token limit issues while retaining the interactive elements needed for your agent. Remember to tailor the content extraction strategy to the specific websites you're working with for optimal results. Consider using Readability type libraries for more advanced content extraction. Also, experiment with different waiting strategies in Playwright to ensure all dynamic content is loaded before parsing.

mamoodi commented 2 days ago

@neubig there's been a few issues opened up for the BrowserAgent. I know there's work being done on replacing it with a better agent. Just want to confirm my understanding is correct and it is being worked on?

neubig commented 2 days ago

That is correct! We're hoping to release our paper and code this week, and then we'll need to do some testing to make sure that we can get it working without regressions to other scores like SWE-bench. I'll update soon.

All-Hands-AI / OpenHands

Improve browser agent' scraping/processing web content #4486