flairNLP / fundus

A very simple news crawler with a funny name
MIT License
126 stars 63 forks source link

Scraping "Occupy Democrats" over Sitemap #178

Closed dobbersc closed 11 months ago

dobbersc commented 1 year ago

I've encountered a problem with the sitemap of Occupy Democrats.

They support a sitemap but a very large portion of sub-sitemaps, e.g. sitemap-tax-post_tag-1227.xml lead to a non-XML list of articles, e.g. https://occupydemocrats.com/tag/zoe-lofgren/.

At the bottom of the sitemap are also standard articles that we should scrape, e.g. https://occupydemocrats.com/sitemap-pt-post-p1-2023-04.xml. Is the "solution" again to only return articles that are fully extracted? I feel like this is a way deeper-rooted problem with our scraper, and the "only return fully extracted articles" is a tiny band-aid requiring intervention from the user.

dobbersc commented 1 year ago

Are we able to filter article hubs with #184 now?

Weyaaron commented 1 year ago

I will investigate this, expect to hear back from me soon.

Weyaaron commented 1 year ago

Yeah, you can do this now:

   OccupyDemocrats = PublisherSpec(
        domain="https://occupydemocrats.com/",
        sitemaps=[
            'https://occupydemocrats.com/sitemap.xml'],
        parser=OccupyDemocratsParser,
        article_classifier=lambda url, html: regex_classifier('tag|sitemap')(url)
    )

Notice the missing not before the classifier, this is not done consistently at the moment. This Code skips all URL according to the rule. Unfortunately, it seems like the URLs with tags come first. So they will be skipped after downloading the HTML each time the parser runs. You may decide for yourself is this bothers you a lot. I will leave the issue open for the moment.

Weyaaron commented 1 year ago

Skipping all the URLs really is a pain, the code is running for several minutes just skipping these URLs. This is probably something @MaxDall has to get involved in.

MaxDall commented 11 months ago

Closed with #247