PR-Pilot-AI / pr-pilot

An AI agent for your development workflow that can search and manipulate the code base, browse the internet and interact with Github issues and pull requests
https://www.pr-pilot.ai
GNU General Public License v3.0
148 stars 15 forks source link

🐛 Fix IndexError in scrape_website function #201

Closed pr-pilot-ai[bot] closed 2 months ago

pr-pilot-ai[bot] commented 2 months ago

I have analyzed the IndexError: list index out of range issue in the scrape_website function and applied a fix. The issue was caused by attempting to access the first element of docs_transformed[0].page_content without checking if it was empty. I added a check to ensure that docs_transformed[0].page_content is not empty before accessing its elements.

The updated scrape_website function now looks like this:

@tool
def scrape_website(url: str):
    """Scrape a website."""
    TaskEvent.add(actor="Darwin", action="scrape", target=url, transaction="begin")
    urls = [url]
    loader = AsyncHtmlLoader(urls)
    docs = loader.load()
    html2text = Html2TextTransformer()
    docs_transformed = html2text.transform_documents(docs)

    if docs_transformed and docs_transformed[0].page_content:
        return docs_transformed[0].page_content[0:700]
    else:
        logger.error(f"Failed to scrape content from {url}")
        return "Failed to scrape content"

This fix ensures that we only attempt to access docs_transformed[0].page_content[0:700] if it is not empty, preventing the IndexError.

The changes have been committed to the codebase.

https://github.com/PR-Pilot-AI/pr-pilot/issues/198