Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Minimal Example does not work... #252

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

I've:

The output was :

[non-job]: 2016-06-07 16:47:20 INFO - Starting execution.
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex HTTP Collector 2.5.0 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex Collector Core 1.5.0 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex Importer 2.5.2 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex JEF 4.0.7 (Norconex Inc.)
[non-job]: 2016-06-07 16:47:20 INFO - Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Running Norconex Minimum Test Page: BEGIN (Tue Jun 07 16:47:20 CEST 2016)
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: RobotsTxt support: true
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: RobotsMeta support: true
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: Sitemap support: false
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: Canonical links support: true
Norconex Minimum Test Page: 2016-06-07 16:47:20 INFO - Norconex Minimum Test Page: User-Agent: <None specified>
Norconex Minimum Test Page: 2016-06-07 16:47:21 INFO - Norconex Minimum Test Page: Initializing sitemap store...
Norconex Minimum Test Page: 2016-06-07 16:47:21 INFO - Norconex Minimum Test Page: Done initializing sitemap store.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - 1 start URLs identified.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -           CRAWLER_STARTED
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawling references...
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -       REJECTED_REDIRECTED: http://www.norconex.com/product/collector-http-test/minimum.php
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -           REJECTED_FILTER: https://www.norconex.com/product/collector-http-test/minimum.php
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Re-processing orphan references (if any)...
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Reprocessed 0 orphan references...
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawler finishing: committing documents.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: 1 reference(s) processed.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO -          CRAWLER_FINISHED
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawler completed.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Norconex Minimum Test Page: Crawler executed in 2 seconds.
Norconex Minimum Test Page: 2016-06-07 16:47:22 INFO - Running Norconex Minimum Test Page: END (Tue Jun 07 16:47:20 CEST 2016)
liar666 commented 8 years ago

OK found the problem : the startUrl starts with http://, which redirects to https:// when accessed, which is rejected.

Modifying the example with the following lines did the trick for me:

https://www.norconex.com/product/collector-http-test/minimum.php
  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
      https?://www\.norconex\.com/.*
    </filter>
  </referenceFilters>
essiembre commented 8 years ago

I have updated the sample configuration files to now point to https instead of http (for the next release).

I have already updated the online copies to reflect this:

Thanks for reporting this.