commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
315 stars 34 forks source link

Improve feed parser robustness #13

Open sebastian-nagel opened 7 years ago

sebastian-nagel commented 7 years ago

As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:

This issue is used as umbrella to track existing feed parser problems and address them step by step:

jnioche commented 7 years ago

Thanks @sebastian-nagel this is very useful.

jnioche commented 7 years ago

NPE

http://chestertontribune.com/rss.xml contains an item without link, which causes the NPE

<item>
<title>
http://chestertontribune.com/Sports/state_park_little_league_registr.htm
</title>
<pubDate>Tue, 19 Feb 2013 20:42:24 GMT</pubDate>
</item> 

https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml is a bit different in that it uses guid instead of link. I'll modify the code so that we take the guid in the absence of a link.

Fixed in https://github.com/DigitalPebble/storm-crawler/commit/cafaf3a052fed20dcde0e7f08336dca22d2e3cf5

jnioche commented 7 years ago

Note : just upgraded Rome-Tools to 1.7.0 in https://github.com/DigitalPebble/storm-crawler/commit/4832c9860fc8fc3e3a66b3f8e687bc81d60e5d8f

sebastian-nagel commented 6 years ago

Alternatively, thinking about using the sitemap parser (based on crawler-commons) to parse the feeds. The important parts (URL and publication date) are also made available by the sitemap parser. I'll try to evaluate both parsers on a larger test set.