All-Hands-AI / OpenHands

🙌 OpenHands: Code Less, Make More
https://all-hands.dev
MIT License
33.15k stars 3.8k forks source link

Improve browser agent' scraping/processing web content #4486

Open tobitege opened 3 days ago

tobitege commented 3 days ago

Summary

Currently the generated axtree content for retrieved websites incurs a huge amount of tokens and cost. Maybe below combination of Playwright with BeautifulSoup can save tokens, cost and runtime?

Here's what Gemini Pro 1.5 gave as a suggestion (general example, not within OpenHands):


You're facing a common challenge when using LLMs with web scraping: the sheer volume of HTML can overwhelm the tokenizer and lead to unnecessary costs and slower processing. Extracting the main content while preserving interactive elements is the right approach. While a single perfect package might not exist, combining existing tools and techniques can achieve this.

Here's a breakdown of a strategy using Playwright and Beautiful Soup, along with explanations and code examples:

1. Extract Content with Playwright and Beautiful Soup:

Playwright is excellent for interacting with dynamic content, while Beautiful Soup excels at parsing HTML. We'll use them together.

from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import re

async def extract_content(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)

        # Wait for dynamic content to load (adjust as needed)
        await page.wait_for_timeout(2000) # or a more robust wait condition

        html = await page.content()
        await browser.close()

        soup = BeautifulSoup(html, "html.parser")

        # Extract main content (strategies below)
        main_content = extract_main_content(soup)

        # Preserve links, adding identifiers
        modified_content = process_links(main_content)

        return modified_content

def extract_main_content(soup):
    # Strategy 1:  Using common article tags
    article_content = soup.find("article")
    if article_content:
        return article_content

    # Strategy 2:  Largest text block (heuristic)
    text_blocks = [text_block for text_block in soup.find_all(text=True) if not text_block.parent.name in ['style', 'script', 'head', '[document]']]
    largest_block = max(text_blocks, key=lambda block: len(block.strip()), default=None)
    if largest_block:
        return largest_block.parent  # Return the parent element

    # Strategy 3:  CSS Selectors (requires knowledge of the page structure)
    main_content = soup.select_one(".main-content") # Example selector
    if main_content:
        return main_content

    return soup.body # Fallback: return the whole body

def process_links(content):
    link_count = 0
    for link in content.find_all("a"):
        link_count += 1
        original_href = link.get("href")
        if original_href:
            link["href"] = f"[{link_count}]({original_href})"  # Markdown link format with ID
            link["id"] = f"link-{link_count}"
    return str(content)

# Example usage (async)
import asyncio
async def main():
    content = await extract_content("https://www.example.com")
    print(content)

asyncio.run(main())

Explanation and Improvements:

2. Pass to LLM:

Now you can pass the modified_content (which is a string of HTML with identifiable links) to your LLM. The reduced size and structured links will make processing significantly more efficient.

3. Agent Interaction:

When the LLM decides to "click" a link, it can refer to the link ID (e.g., "Click link 3"). Your agent can then parse this instruction, extract the corresponding URL from the original HTML (using the ID), and navigate to that URL using Playwright.

This combined approach addresses your token limit issues while retaining the interactive elements needed for your agent. Remember to tailor the content extraction strategy to the specific websites you're working with for optimal results. Consider using Readability type libraries for more advanced content extraction. Also, experiment with different waiting strategies in Playwright to ensure all dynamic content is loaded before parsing.

mamoodi commented 2 days ago

@neubig there's been a few issues opened up for the BrowserAgent. I know there's work being done on replacing it with a better agent. Just want to confirm my understanding is correct and it is being worked on?

neubig commented 2 days ago

That is correct! We're hoping to release our paper and code this week, and then we'll need to do some testing to make sure that we can get it working without regressions to other scores like SWE-bench. I'll update soon.