Improve feed parser robustness

sebastian-nagel commented 7 years ago

As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:

(35 feeds) unknown entities ‘ or ú etc.

2016-11-22 16:21:14.949 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://rakurs.rovno.ua/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 282:
The entity "lsquo" was referenced, but not declared.
2016-11-22 16:18:18.177 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.diariolaestrella.com/150/index.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 17:
The entity "uacute" was referenced, but not declared.
2016-11-22 16:19:35.721 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.iltalehti.fi/rss/rss.xml: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 66:
The entity "euro" was referenced, but not declared.

(20 feeds) single ampersands

2016-11-22 16:18:07.643 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.northerniowan.com/feed/atom/: com.rometools.rome.io.ParsingFeedException: Invalid XML:
Error on line 84: The entity name must immediately follow the '&' in the entity reference.

RSS extensions

2016-11-22 18:20:14.535 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.amurpravda.ru/rss/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 20:
The prefix "yandex" for element "yandex:full-text" is not bound.

leading newlines / white space / BOMs

2016-11-22 16:20:12.279 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.pixelmonsters.de/feed/gamenews.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 2:
The processing instruction target matching "[xX][mM][lL]" is not allowed.

NPEs (!)

2016-11-22 16:18:53.325 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://chestertontribune.com/rss.xml: java.lang.NullPointerException
2016-11-22 16:18:21.004 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://atv.at/atom.xml: java.lang.NullPointerException
2016-11-22 16:27:35.163 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml: java.lang.NullPointerException

encoding issues

2016-11-22 16:20:33.593 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://newamericamedia.org/atom.xml: com.rometools.rome.io.ParsingFeedException:
Invalid XML: Error on line 534: Invalid byte 2 of 3-byte UTF-8 sequence.

This issue is used as umbrella to track existing feed parser problems and address them step by step:

reproduce problems in isolation, e.g., add unit tests to SC's FeedParserBoltTest
upgrade Rome library and test again
open issues for Rome or SC

jnioche commented 7 years ago

Thanks @sebastian-nagel this is very useful.

jnioche commented 7 years ago

NPE

http://chestertontribune.com/rss.xml contains an item without link, which causes the NPE

<item>
<title>
http://chestertontribune.com/Sports/state_park_little_league_registr.htm
</title>
<pubDate>Tue, 19 Feb 2013 20:42:24 GMT</pubDate>
</item>

https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml is a bit different in that it uses guid instead of link. I'll modify the code so that we take the guid in the absence of a link.

Fixed in https://github.com/DigitalPebble/storm-crawler/commit/cafaf3a052fed20dcde0e7f08336dca22d2e3cf5

jnioche commented 7 years ago

Note : just upgraded Rome-Tools to 1.7.0 in https://github.com/DigitalPebble/storm-crawler/commit/4832c9860fc8fc3e3a66b3f8e687bc81d60e5d8f

sebastian-nagel commented 6 years ago

Alternatively, thinking about using the sitemap parser (based on crawler-commons) to parse the feeds. The important parts (URL and publication date) are also made available by the sitemap parser. I'll try to evaluate both parsers on a larger test set.

commoncrawl / news-crawl

Improve feed parser robustness #13