Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Canonical link handling when stayOnDomain=true #527

Closed github-il closed 4 years ago

github-il commented 6 years ago

Hi, What is the expected behavior when you encounter a canonical link in a document which points to another domain, and you have stayOnDomain set to true?

I'm seeing that the canonical link is followed, however this seems counter-intuitive to me.

I setup an Ubuntu VM and installed Apache and the http-collector (2.8.1) software, and added a canoncial link into the default index.html page as shown below ;

diff index.html index.html-orig
11d10
<     <link rel="canonical" href="https://www.apache.org/" />

I then modified the minimum-config.xml to point to the local apache server, and depth to 1. The example has stayOnDomain to be true.

However when I run the crawl I see that the canonical link is followed ;

Oct 03, 2018 4:58:29 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO  [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.8.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.9.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.2 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Wed Oct 03 16:58:30 UTC 2018)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://localhost
INFO  [CrawlerEventManager]     REJECTED_NONCANONICAL: http://localhost
INFO  [CrawlerEventManager] DOCUMENT_COMMITTED_REMOVE: http://localhost
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://www.apache.org/
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://www.apache.org/
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://www.apache.org/
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://www.apache.org/
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://www.apache.org/
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://www.apache.org/foundation/policies/privacy.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://www.apache.org/foundation/policies/privacy.html
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: https://www.apache.org/foundation/getinvolved.html

Is this a bug, or expected behavior ?

Thanks.

essiembre commented 6 years ago

No, it is not the expected behavior. I noticed this line in your logs though:

INFO [JobSuite] Previous execution detected.

If you previously ran it with a different domain configured, it is possible those were cached and are being re-verified. You can change the default "orphan strategy" if so. A safer approach when you make a significant change to the config (or when unsure), make sure you clean the work directory (especially the crawl store), and try again. Let me know if that fixes it. Else, can you please share your config?

github-il commented 6 years ago

Pascal,

Thanks for your update.

I removed the work directory and re-ran and I still see the same behavior.

I have attached the config file (canonical.xml) and the apache index.html page I am using as well as the log file (debug.log), in the following zip file. files.zip

Thanks

essiembre commented 5 years ago

I had time to reproduce and provide a fix. You can find it in the latest snapshot. You may have to start fresh (has it will try to reprocess orphans by default).

Please confirm.