Closed dobbersc closed 11 months ago
Are we able to filter article hubs with #184 now?
I will investigate this, expect to hear back from me soon.
Yeah, you can do this now:
OccupyDemocrats = PublisherSpec(
domain="https://occupydemocrats.com/",
sitemaps=[
'https://occupydemocrats.com/sitemap.xml'],
parser=OccupyDemocratsParser,
article_classifier=lambda url, html: regex_classifier('tag|sitemap')(url)
)
Notice the missing not before the classifier, this is not done consistently at the moment. This Code skips all URL according to the rule. Unfortunately, it seems like the URLs with tags come first. So they will be skipped after downloading the HTML each time the parser runs. You may decide for yourself is this bothers you a lot. I will leave the issue open for the moment.
Skipping all the URLs really is a pain, the code is running for several minutes just skipping these URLs. This is probably something @MaxDall has to get involved in.
Closed with #247
I've encountered a problem with the sitemap of Occupy Democrats.
They support a sitemap but a very large portion of sub-sitemaps, e.g. sitemap-tax-post_tag-1227.xml lead to a non-XML list of articles, e.g. https://occupydemocrats.com/tag/zoe-lofgren/.
At the bottom of the sitemap are also standard articles that we should scrape, e.g. https://occupydemocrats.com/sitemap-pt-post-p1-2023-04.xml. Is the "solution" again to only return articles that are fully extracted? I feel like this is a way deeper-rooted problem with our scraper, and the "only return fully extracted articles" is a tiny band-aid requiring intervention from the user.