Closed olgapshen closed 8 years ago
Maybe there is an error in your config. Feel free to attach it.
Otherwise, it is rejected because the number of pages one needs to go through to reach it (from your start URL) is too high. Can you confirm that from your start URL it takes less than 5 link "clicks" to get to the URL you mention?
OK! That is! )) Chain length ) I was think that this is url nodes count. Thank you ) any way I will attach the config.
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2015 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
to run a crawler.
-->
<httpcollector id="Minimum Config HTTP Collector">
<!-- Decide where to store generated files. -->
<progressDir>./tigrsoft/progress</progressDir>
<logsDir>./tigrsoft/logs</logsDir>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<!-- Requires at least one start URL (or urlsFile).
Optionally limit crawling to same protocol/domain/port as
start URLs. -->
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://en.wikipedia.org/wiki</url>
</startURLs>
<!-- === Recommendations: ============================================ -->
<!-- Specify a crawler default directory where to generate files. -->
<workDir>./tigrsoft</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>5</maxDepth>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<!-- Before 2.3.0: -->
<sitemap ignore="true" />
<!-- Since 2.3.0: -->
<sitemapResolverFactory ignore="true" />
<!-- Be as nice as you can to sites you crawl. -->
<delay default="5000" />
<documentFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
onMatch="include"
caseSensitive="false">tiger</filter>
</documentFilters>
<!-- Decide what to do with your files by specifying a Committer. -->
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./tigrsoft/crawledFiles</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
Thanks for confirming. I can tell you nothing seems wrong with yoru config.
Hi, I have follow info in my log: INFO [CrawlerEventManager] REJECTED_TOO_DEEP: https://en.wikipedia.org/wiki/Spanish_conquest_of_Yucat%C3%A1n
But I set:5
The depth in info line before is two...
What is wrong?