Open Chaiavi opened 5 years ago
Good catch: SitemapTester and also the recursive parsing method SiteMapParser.walkSiteMap(...)
do not catch exceptions if some of the subsitemaps of a sitemap index fail for some reason. https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml isn't a sitemap but a HTML page, so the failure is expected.
+1 to catch exceptions and log them instead. If anybody needs a validating parser which quits on the first error, it can be easily implemented. But we might also provide an optional argument to control the behavior on errors. Any thoughts?
Just to clarify the scenario
This specific scenario is attempting to Sitemap Parse Adwords Main Sitemap which is a legit sitemapIndex, containing links to other bad sitemaps like this one
I am +1 on catching the internal exception and logging it
About the validating parser option, if we would like to implement it, can we use the Strict or AllowPartial flags ?
The option strict has a clearly different meaning, see #267. The option allowPartial may fit but we could also define a new one with the defined semantics to keep going when parsing sitemap indexes recursively.
From an implementation standpoint, should it be the responsibility of the user of SiteMapParser
to catch and handle exceptions? (ie. the SiteMapTester
class, logs the exception and continues instead of terminating)
@evanhalley: Yes, I principally agree that users shall handle the exceptions.
However, in the current SiteMapParser implementation all kinds of exceptions are caught, eventually logged and then ignored. The parser is tuned to keep going and not to fail at the first invalid XML tag or entity, any malformed URL, lastmod date, priority, etc. That makes sense because a web crawler wants URLs as many as possible.
In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?
the
SiteMapTester
class, logs the exception and continues
Yes, and strictly speaking the SiteMapTester
is a user of the API (with #197 it would be moved to the tools module). It'd need to catch the exception because it calls the parser to in a loop for each (sub)sitemap of a sitemap index. The method SiteMapParser.walkSiteMap(...)
is part of the API.
In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?
I like the idea. In crawler4j
, we are doing something similar (although, we do not use the concept of an exception handler): We just provide some (protected) hook-up methods like onUnhandledException
, onContentFetchException(...)
to enable the user to do something with exceptions thrown.
So I am +1 here :)
Ran the "Main" of SiteMapTester with the argument of: https://www.google.com/sitemap.xml
Our recursive sitemap tester fails with the following exception: