Open sebastian-nagel opened 7 years ago
Thanks @sebastian-nagel this is very useful.
NPE
http://chestertontribune.com/rss.xml contains an item without link, which causes the NPE
<item>
<title>
http://chestertontribune.com/Sports/state_park_little_league_registr.htm
</title>
<pubDate>Tue, 19 Feb 2013 20:42:24 GMT</pubDate>
</item>
https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml is a bit different in that it uses guid instead of link. I'll modify the code so that we take the guid in the absence of a link.
Fixed in https://github.com/DigitalPebble/storm-crawler/commit/cafaf3a052fed20dcde0e7f08336dca22d2e3cf5
Note : just upgraded Rome-Tools to 1.7.0 in https://github.com/DigitalPebble/storm-crawler/commit/4832c9860fc8fc3e3a66b3f8e687bc81d60e5d8f
Alternatively, thinking about using the sitemap parser (based on crawler-commons) to parse the feeds. The important parts (URL and publication date) are also made available by the sitemap parser. I'll try to evaluate both parsers on a larger test set.
As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:
‘
orú
etc.This issue is used as umbrella to track existing feed parser problems and address them step by step: