Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Tries to follow links with "tel:" schema #212

Closed niels closed 8 years ago

niels commented 8 years ago

Given

A page linking to a tel: URI:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Norconex test</title>
  </head>

  <body>
    <a href="tel:123">Phone Number</a>
  </body>
</html>

And the following config:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <startURLs>
        <url>https://herimedia.com/norconex-test/phone.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Expected

The collector should not follow this link – or that of any other schema it can't actually process.

Actual

The collectors tries to follow the tel: link.

INFO  [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.5.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [JobSuite] Running test-crawler: BEGIN (Fri Jan 08 16:21:17 CET 2016)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO  [HttpCrawler] test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] test-crawler: Sitemap support: true
INFO  [HttpCrawler] test-crawler: Canonical links support: true
INFO  [HttpCrawler] test-crawler: User-Agent: <None specified>
INFO  [SitemapStore] test-crawler: Initializing sitemap store...
INFO  [SitemapStore] test-crawler: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] test-crawler: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://herimedia.com/norconex-test/phone.html
INFO  [CrawlerEventManager]         REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123
INFO  [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...
INFO  [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...
INFO  [AbstractCrawler] test-crawler: 2 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] test-crawler: Crawler completed.
INFO  [AbstractCrawler] test-crawler: Crawler executed in 6 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
INFO  [JobSuite] Running test-crawler: END (Fri Jan 08 16:21:17 CET 2016)

Note the REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123 message.

essiembre commented 8 years ago

By default GenericLinkExtractor now only handle these URL schemes: http, https, and ftp. This can be overwritten, like this:

  <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
     <schemes>http, https, somefunnyone</schemes>
  </extractor>

I did not upgrade the TikaLinkExtractor with the same ability, since people may want to use the Tika implementation for what it is supposed to do out of the box (which seems to extract URIs for all schemas).

This has been added to the latest snapshot.

Because this new logic will extract less links (what we want), I hope it won't cause regression issues for some people.

niels commented 8 years ago

Confirmed working. Thank you!