[Bug]: rag_google_documentation.ipynb has isssues in execution

rafiqhasan commented 2 months ago

File Name

/search/retrieval-augmented-generation/examples/rag_google_documentation.ipynb

What happened?

# Given a Google documentation URL, retrieve a list of all text chunks within h2 sections
def get_sections(url: str) -> list[str]:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    sections = []
    paragraphs = []

    body_div = soup.find("div", class_="devsite-article-body")
    for child in body_div.findChildren():
        if child.name == "p":
            paragraphs.append(child.get_text().strip())
        if child.name == "h2":
            sections.append(" ".join(paragraphs))
            break

    for header in soup.find_all("h2"):
        paragraphs = []
        nextNode = header.nextSibling
        while nextNode:
            if isinstance(nextNode, Tag):
                if nextNode.name in {"p", "ul"}:
                    paragraphs.append(nextNode.get_text().strip())
                elif nextNode.name == "h2":
                    sections.append(" ".join(paragraphs))
                    break
            nextNode = nextNode.nextSibling
    return sections

Needs to be fixed to handle cases when there is no H2 or devsite-article-body class / tag. Currently the code for child in body_div.findChildren(): runs into error if no such tag is found in the URL source code

Relevant log output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-440b9131ebc9> in <cell line: 1>()
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]

1 frames
<ipython-input-6-440b9131ebc9> in <listcomp>(.0)
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]

<ipython-input-5-73e0f3cdcce1> in get_sections(url)
      8 
      9     body_div = soup.find("div", class_="devsite-article-body")
---> 10     for child in body_div.findChildren():
     11         if child.name == "p":
     12             paragraphs.append(child.get_text().strip())

AttributeError: 'NoneType' object has no attribute 'findChildren'

CC: @holtskinner

holtskinner commented 1 month ago

@grivescorbett is the creator of this notebook.

holtskinner commented 1 month ago

Possible improvement to be made to this notebook:

The Document AI Layout Parser can handle HTML pages. This could be a way to extract the paragraph/title/etc information without doing the manual HTML parsing.

GoogleCloudPlatform / generative-ai