crawler-commons / crawler-commons

A set of reusable Java components that implement functionality common to any web crawler
Apache License 2.0
237 stars 75 forks source link

Implement behavior on parsing of a SitemapIndex with a bad sitemap link in it #259

Open Chaiavi opened 5 years ago

Chaiavi commented 5 years ago

Ran the "Main" of SiteMapTester with the argument of: https://www.google.com/sitemap.xml

Our recursive sitemap tester fails with the following exception:

[main] INFO crawlercommons.sitemaps.SiteMapTester - Parsing https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml 
Exception in thread "main" crawlercommons.sitemaps.UnknownFormatException: Failed to detect MediaType of sitemap 'https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml'
    at crawlercommons.sitemaps.SiteMapParser.parseSiteMap(SiteMapParser.java:269)
    at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:77)
    at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:85)
    at crawlercommons.sitemaps.SiteMapTester.parse(SiteMapTester.java:85)
    at crawlercommons.sitemaps.SiteMapTester.main(SiteMapTester.java:53)
sebastian-nagel commented 5 years ago

Good catch: SitemapTester and also the recursive parsing method SiteMapParser.walkSiteMap(...) do not catch exceptions if some of the subsitemaps of a sitemap index fail for some reason. https://www.google.com/adwords/intl/ALL_ar/adwords/sitemap.xml isn't a sitemap but a HTML page, so the failure is expected.

+1 to catch exceptions and log them instead. If anybody needs a validating parser which quits on the first error, it can be easily implemented. But we might also provide an optional argument to control the behavior on errors. Any thoughts?

Chaiavi commented 5 years ago

Just to clarify the scenario

This specific scenario is attempting to Sitemap Parse Adwords Main Sitemap which is a legit sitemapIndex, containing links to other bad sitemaps like this one

I am +1 on catching the internal exception and logging it

About the validating parser option, if we would like to implement it, can we use the Strict or AllowPartial flags ?

sebastian-nagel commented 5 years ago

The option strict has a clearly different meaning, see #267. The option allowPartial may fit but we could also define a new one with the defined semantics to keep going when parsing sitemap indexes recursively.

evanhalley commented 2 years ago

From an implementation standpoint, should it be the responsibility of the user of SiteMapParser to catch and handle exceptions? (ie. the SiteMapTester class, logs the exception and continues instead of terminating)

sebastian-nagel commented 2 years ago

@evanhalley: Yes, I principally agree that users shall handle the exceptions.

However, in the current SiteMapParser implementation all kinds of exceptions are caught, eventually logged and then ignored. The parser is tuned to keep going and not to fail at the first invalid XML tag or entity, any malformed URL, lastmod date, priority, etc. That makes sense because a web crawler wants URLs as many as possible.

In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?

the SiteMapTester class, logs the exception and continues

Yes, and strictly speaking the SiteMapTester is a user of the API (with #197 it would be moved to the tools module). It'd need to catch the exception because it calls the parser to in a loop for each (sub)sitemap of a sitemap index. The method SiteMapParser.walkSiteMap(...) is part of the API.

rzo1 commented 2 years ago

In order to keep going the parseSiteMap method cannot throw the exception, loosing its internal state that way. But maybe it's a good idea to allow users to register an exception handler? And we could provide one (default, same behavior as now) to ignore format errors if it's possible to continue, and one to report or fail on any error (to implement a validating parser). Any thoughts?

I like the idea. In crawler4j, we are doing something similar (although, we do not use the concept of an exception handler): We just provide some (protected) hook-up methods like onUnhandledException, onContentFetchException(...) to enable the user to do something with exceptions thrown.

So I am +1 here :)