commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
321 stars 35 forks source link

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One example feed:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.drudge.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wordzilla="http://www.cadenhead.org/workbench/wordzilla/namespace" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">