Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

REJECTED_TOO_DEEP when max_depth set to height enough value #268

Closed olgapshen closed 8 years ago

olgapshen commented 8 years ago

Hi, I have follow info in my log: INFO [CrawlerEventManager] REJECTED_TOO_DEEP: https://en.wikipedia.org/wiki/Spanish_conquest_of_Yucat%C3%A1n

But I set: 5

The depth in info line before is two...

What is wrong?

essiembre commented 8 years ago

Maybe there is an error in your config. Feel free to attach it.

Otherwise, it is rejected because the number of pages one needs to go through to reach it (from your start URL) is too high. Can you confirm that from your start URL it takes less than 5 link "clicks" to get to the URL you mention?

olgapshen commented 8 years ago

OK! That is! )) Chain length ) I was think that this is url nodes count. Thank you ) any way I will attach the config.

<?xml version="1.0" encoding="UTF-8"?>
<!--
   Copyright 2010-2015 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.
     -->
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./tigrsoft/progress</progressDir>
  <logsDir>./tigrsoft/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile).
           Optionally limit crawling to same protocol/domain/port as
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://en.wikipedia.org/wiki</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./tigrsoft</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>5</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Before 2.3.0: -->
      <sitemap ignore="true" />
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include"
                caseSensitive="false">tiger</filter>
      </documentFilters>

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./tigrsoft/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 8 years ago

Thanks for confirming. I can tell you nothing seems wrong with yoru config.