Closed niels closed 8 years ago
Since the sitemaps are under not under the root path, I also tried adding the following but setting lenient="true"
did not fix the issue.
<sitemapResolverFactory
class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory"
ignore="false"
lenient="true"
/>
I have not had a chance to give it a try yet, but in the meantime, have you tried changing the log level to DEBUG in many places in the classes/log4j.properties
file? It may give you more insights.
The debug messages were what pointed me towards the path issue. Sorry for not including them in my original report! A sample message would be:
DEBUG [StandardSitemapResolver] Sitemap URL invalid for location directory. URL:http://www.mascus.com/adarchive/agriculture/m/75 Location directory: http://www.mascus.com/googleXML/sitemaps/com
Setting the lenient
option seems to do nothing both in v2.3.0 as well as the latest snapshot. For your reference, here is the complete config:
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="sitemap-test">
<crawlers>
<crawler id="sitemap-test-crawler">
<sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory" />
<startURLs>
<sitemap>http://www.mascus.com/sitemap_index_com.xml</sitemap>
</startURLs>
</crawler>
</crawlers>
</httpcollector>
Note that setting ignore="true"
does have the intended effect (i.e. stops the sitemap from being pulled in at all) so that I think I added the configuration at the right spot and that it is syntactically correct.
I had a quick glance at StandardSitemapResolverFactory.java
but couldn't find any glaring issue (such as spelling differences). Perhaps it's a user error after all?
You did everything fine. The lenient flag is not carried through for some reason. I will investigate and provide a fix.
I made a new snapshot release with a fix. The "lenient" flag is now honored. 939,747 start URLs were identified, but not all would be processed since some were rejected.
There were many rejections due to robot.txt rules. You can ignore robot rules if you want:
<robotsTxt ignore="true" />
You probably want to have the "ignore" flag set to true as well on the sitemapResolverFactory
tag, as otherwise it will try to locate sitemaps at usual locations and you may end up processing some of them more than once. Sitemap handling has evolved over time and that ignore flag is no longer the most intuitive, but it basically means to not try to guess where some sitemaps could be located (there is no need since you supply it yourself as a start URL).
With that many URLs for a single crawl instance, note that it started to get a bit slower right after the sitemaps were processed in my tests. Give it some time after that and it will eventually resume and process each URLs.
Let me know how that goes.
Pascal, thank you very much for the excellent resolution time!
I have tested the latest snapshot and can confirm that the issue has been resolved.
Re the number of pages, we actually filter these down significantly on our end using referenceFilters so the crawl size won't be a problem. We just needed to get the discovery phase fixed :)
Re the sitemap ignore flag, in our actual deployment we use the sitemap auto-discovery feature (instead of specifying the sitemap as the start URL) to handle potential sitemap moves automatically, so setting the ignore flag would be counter-productive in our particular use-case.
Re the robots rejections, I think there might be a bug (or lack of feature) that leads to robots rules meant for specific "User-agent"s to be applied globally (even if the collector's user agent doesn't match). I will investigate further and open a new issue if necessary.
Thanks again!
Great! Thanks for providing feedback.
We are evaluating Norconex HTTP Collector as a replacement for a custom-built web crawler. One of the domains that we would want to crawl is mascus.com who provide a few dozen sitemaps all referenced from http://www.mascus.com/sitemap_index_com.xml.
When starting the crawl at that sitemap index, Norconex correctly resolves all the linked sitemaps but then ends up saying "0 start URLs identified" and finishing the crawl. Does it choke on some peculiarity of those specific sitemaps? I have run tests against a few other sites where the
<loc>
s from their sitemaps where ingested correctly.I have used both the 2.3.0 release as well as the latest snapshot for testing. To (hopefully!) exclude mis-configuration as a potential cause, I have used the simplest configuration I could come up with:
This is the result:
For easier reading, I have removed the remaining "Resolving sitemap", "Resolved" log entries. You can find the entire log at https://gist.github.com/niels/f733203c3d43d0d3cea9.