AlaskanTuna / Webscraping-Experiment

0 stars 0 forks source link

Heading to Paragraph Scraping Issue #1

Closed AlaskanTuna closed 1 month ago

AlaskanTuna commented 1 month ago

In certain situations where there are multiple paragraphs after a highlighted heading, only the first paragraph gets scraped.

Example: image image

When scraping this website, it can be assumed that the paragraph elements after the initial paragraph ("(Avoiding Spoilers)") are the actual useful data, however, 0.0.1's 102 code is only programmed to scrape the first paragraph element, overlooking the rest.

This will require an extension fix for the code (e.g. increased detection range, paragraph selection).

        for header in chosen_headers:
            header_text = header.text.strip()
            content = []
            sibling = header.find_next_sibling()
            while sibling and sibling.name != header.name:
                if sibling.name == 'p':
                    content.append(sibling.text.strip())
                sibling = sibling.find_next_sibling()
            article_data[header_text] = ' '.join(content)

        articles_data.append(article_data)
AlaskanTuna commented 1 month ago

Scrape by <div class>, only scrape text.