Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

No harvest, despite a fertile seed #193

Closed niels closed 8 years ago

niels commented 8 years ago

We are evaluating Norconex HTTP Collector as a replacement for a custom-built web crawler. One of the domains that we would want to crawl is mascus.com who provide a few dozen sitemaps all referenced from http://www.mascus.com/sitemap_index_com.xml.

When starting the crawl at that sitemap index, Norconex correctly resolves all the linked sitemaps but then ends up saying "0 start URLs identified" and finishing the crawl. Does it choke on some peculiarity of those specific sitemaps? I have run tests against a few other sites where the <loc>s from their sitemaps where ingested correctly.

I have used both the 2.3.0 release as well as the latest snapshot for testing. To (hopefully!) exclude mis-configuration as a potential cause, I have used the simplest configuration I could come up with:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="sitemap-test">
  <crawlers>
    <crawler id="sitemap-test-crawler">
      <startURLs>
        <sitemap>http://www.mascus.com/sitemap_index_com.xml</sitemap>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

This is the result:

>> ./collector-http.sh -a start -c sitemap-test.xml 
INFO  [AbstractCollectorConfig] Configuration loaded: id=sitemap-test; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.3.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.4.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [JobSuite] Running sitemap-test-crawler: BEGIN (Fri Dec 04 11:23:54 CET 2015)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/sitemap-test-crawler/
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/sitemap-test-crawler/: Done initializing databases.
INFO  [HttpCrawler] sitemap-test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] sitemap-test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] sitemap-test-crawler: Sitemap support: true
INFO  [HttpCrawler] sitemap-test-crawler: Canonical links support: true
INFO  [HttpCrawler] sitemap-test-crawler: User-Agent: <None specified>
INFO  [SitemapStore] sitemap-test-crawler: Initializing sitemap store...
INFO  [SitemapStore] sitemap-test-crawler: Done initializing sitemap store.
INFO  [StandardSitemapResolver] Resolving sitemap: http://www.mascus.com/sitemap_index_com.xml
INFO  [StandardSitemapResolver] Resolving sitemap: http://www.mascus.com/googleXML/sitemaps/com/com_adArchive.xml
INFO  [StandardSitemapResolver]          Resolved: http://www.mascus.com/googleXML/sitemaps/com/com_adArchive.xml
INFO  [StandardSitemapResolver] Resolving sitemap: http://www.mascus.com/googleXML/sitemaps/com/com_adIndex.xml
INFO  [StandardSitemapResolver]          Resolved: http://www.mascus.com/googleXML/sitemaps/com/com_adIndex.xml
INFO  [StandardSitemapResolver] Resolving sitemap: http://www.mascus.com/googleXML/sitemaps/com/com_agriculture_allbrands_browse.xml
INFO  [StandardSitemapResolver]          Resolved: http://www.mascus.com/googleXML/sitemaps/com/com_agriculture_allbrands_browse.xml
INFO  [StandardSitemapResolver] Resolving sitemap: http://www.mascus.com/googleXML/sitemaps/com/com_agriculture_brand_browse.xml
INFO  [StandardSitemapResolver]          Resolved: http://www.mascus.com/googleXML/sitemaps/com/com_agriculture_brand_browse.xml
INFO  [StandardSitemapResolver] Resolving sitemap: http://www.mascus.com/googleXML/sitemaps/com/com_agriculture_brand_country.xml
INFO  [StandardSitemapResolver]          Resolved: http://www.mascus.com/googleXML/sitemaps/com/com_agriculture_brand_country.xml

[… SNIP …]

INFO  [StandardSitemapResolver]          Resolved: http://www.mascus.com/sitemap_index_com.xml
INFO  [HttpCrawler] 0 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] sitemap-test-crawler: Crawling references...
INFO  [AbstractCrawler] sitemap-test-crawler: Re-processing orphan references (if any)...
INFO  [AbstractCrawler] sitemap-test-crawler: Reprocessed 0 orphan references...
INFO  [AbstractCrawler] sitemap-test-crawler: 0 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] sitemap-test-crawler: Crawler completed.
INFO  [AbstractCrawler] sitemap-test-crawler: Crawler executed in 1 minute 18 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/sitemap-test-crawler/
INFO  [JobSuite] Running sitemap-test-crawler: END (Fri Dec 04 11:23:54 CET 2015)

For easier reading, I have removed the remaining "Resolving sitemap", "Resolved" log entries. You can find the entire log at https://gist.github.com/niels/f733203c3d43d0d3cea9.

niels commented 8 years ago

Since the sitemaps are under not under the root path, I also tried adding the following but setting lenient="true" did not fix the issue.

    <sitemapResolverFactory
      class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory"
      ignore="false"
      lenient="true"
    />
essiembre commented 8 years ago

I have not had a chance to give it a try yet, but in the meantime, have you tried changing the log level to DEBUG in many places in the classes/log4j.properties file? It may give you more insights.

niels commented 8 years ago

The debug messages were what pointed me towards the path issue. Sorry for not including them in my original report! A sample message would be:

DEBUG [StandardSitemapResolver] Sitemap URL invalid for location directory. URL:http://www.mascus.com/adarchive/agriculture/m/75 Location directory: http://www.mascus.com/googleXML/sitemaps/com

Setting the lenient option seems to do nothing both in v2.3.0 as well as the latest snapshot. For your reference, here is the complete config:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="sitemap-test">
  <crawlers>
    <crawler id="sitemap-test-crawler">
      <sitemapResolverFactory ignore="false" lenient="true" class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory" />

      <startURLs>
        <sitemap>http://www.mascus.com/sitemap_index_com.xml</sitemap>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Note that setting ignore="true" does have the intended effect (i.e. stops the sitemap from being pulled in at all) so that I think I added the configuration at the right spot and that it is syntactically correct.

I had a quick glance at StandardSitemapResolverFactory.java but couldn't find any glaring issue (such as spelling differences). Perhaps it's a user error after all?

essiembre commented 8 years ago

You did everything fine. The lenient flag is not carried through for some reason. I will investigate and provide a fix.

essiembre commented 8 years ago

I made a new snapshot release with a fix. The "lenient" flag is now honored. 939,747 start URLs were identified, but not all would be processed since some were rejected.

There were many rejections due to robot.txt rules. You can ignore robot rules if you want:

<robotsTxt ignore="true" />

You probably want to have the "ignore" flag set to true as well on the sitemapResolverFactory tag, as otherwise it will try to locate sitemaps at usual locations and you may end up processing some of them more than once. Sitemap handling has evolved over time and that ignore flag is no longer the most intuitive, but it basically means to not try to guess where some sitemaps could be located (there is no need since you supply it yourself as a start URL).

With that many URLs for a single crawl instance, note that it started to get a bit slower right after the sitemaps were processed in my tests. Give it some time after that and it will eventually resume and process each URLs.

Let me know how that goes.

niels commented 8 years ago

Pascal, thank you very much for the excellent resolution time!

I have tested the latest snapshot and can confirm that the issue has been resolved.

Re the number of pages, we actually filter these down significantly on our end using referenceFilters so the crawl size won't be a problem. We just needed to get the discovery phase fixed :)

Re the sitemap ignore flag, in our actual deployment we use the sitemap auto-discovery feature (instead of specifying the sitemap as the start URL) to handle potential sitemap moves automatically, so setting the ignore flag would be counter-productive in our particular use-case.

Re the robots rejections, I think there might be a bug (or lack of feature) that leads to robots rules meant for specific "User-agent"s to be applied globally (even if the collector's user agent doesn't match). I will investigate further and open a new issue if necessary.

Thanks again!

essiembre commented 8 years ago

Great! Thanks for providing feedback.