Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Mishandling of UTF-8 in redirect targets #199

Closed niels closed 8 years ago

niels commented 8 years ago

Given a redirect from http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html to http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html, the collector somehow chokes on the Cyrillic characters in the (new) target URL:

Redirect:

$ curl -I "http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html"
HTTP/1.1 301 Moved Permanently
Cache-Control: no-cache, no-store
Pragma: no-cache
Content-Length: 0
Content-Type: text/html
Expires: -1
Location: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
Set-Cookie: MascusSettings=sesid={B4FD7CCB-F4BA-4BDE-AE26-CA7E99821A29}&u_pa_country=US&s_language=EN&s_currency=USD&s_system=imperial&s_power=hp&s_distance=mil&s_weight=lbs&s_width=feet; path=/; HttpOnly
X-Powered-By: ASP.NET
P3P: CP="NOI DSP LAW NID"
Date: Wed, 09 Dec 2015 10:52:59 GMT

Test-Case Config

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <sitemapResolverFactory class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory" ignore="true" />

      <startURLs>
        <url>http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Result

$ ./collector-http.sh -a start -c test.xml
DEBUG [CollectorConfigLoader] Loading configuration file: test.xml
DEBUG [CrawlerConfigLoader] Crawler configuration loaded: test-crawler
INFO  [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
DEBUG [FileLogManager] Log directory: /home/niels/Projects/UMF-crawler/vendor/test/./logs
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.3.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.4.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [JobSuite] Running test-crawler: BEGIN (Wed Dec 09 11:53:00 CET 2015)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
DEBUG [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: New databases created.
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO  [HttpCrawler] test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] test-crawler: Sitemap support: false
INFO  [HttpCrawler] test-crawler: Canonical links support: true
INFO  [HttpCrawler] test-crawler: User-Agent: <None specified>
DEBUG [HttpCrawler] It is recommended you identify yourself to web sites by specifying a user agent (https://en.wikipedia.org/wiki/User_agent)
INFO  [SitemapStore] test-crawler: Initializing sitemap store...
DEBUG [SitemapStore] test-crawler: Sitemap store created.
INFO  [SitemapStore] test-crawler: Done initializing sitemap store.
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/*addtofav$)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/errorlog/)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/Mail/)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/WebResource.axd)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/+/categorypath)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/%23all)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/*.pdf)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/53597944)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/20866434)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/calculator)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:*ctl=)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:*?prodid=)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:*ctgn=)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/*mascusproduct)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/productCard3print.aspx)
DEBUG [StandardRobotsTxtProvider$RobotData] Add filter from robots.txt: Robots.txt (Disallow:/tellafriend.aspx)
DEBUG [StandardRobotsTxtProvider] Fetched and parsed robots.txt: http://www.mascus.com/robots.txt
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/*addtofav$)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/errorlog/)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/Mail/)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/WebResource.axd)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/+/categorypath)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/%23all)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/*.pdf)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/53597944)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/20866434)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/calculator)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:*ctl=)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:*?prodid=)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:*ctgn=)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/*mascusproduct)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/productCard3print.aspx)
DEBUG [ReferenceFiltersStageUtil] ACCEPTED document reference (robots.txt). Reference=http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html Filter=Robots.txt (Disallow:/tellafriend.aspx)
DEBUG [QueueReferenceStage] Queued for processing: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@a851e2d)
INFO  [AbstractCrawler] test-crawler: Crawling references...
DEBUG [AbstractCrawler] test-crawler: Crawler thread #1 started.
DEBUG [AbstractCrawler] test-crawler: Crawler thread #2 started.
DEBUG [AbstractCrawler] test-crawler: Processing reference: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html
DEBUG [GenericDocumentFetcher] Fetching document: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html
DEBUG [GenericDocumentFetcher] Encoded URI: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html
DEBUG [HttpCrawlerRedirectStrategy] URL redirect: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html -> http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гÑдÑавлÑка-ÑпеÑÑеÑнÑка/5pen7jcp.html
DEBUG [GenericDocumentFetcher] Unsupported HTTP Response: HTTP/1.1 301 Moved Permanently
INFO  [CrawlerEventManager]       REJECTED_REDIRECTED: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html (Subject: HttpFetchResponse [crawlState=REDIRECT, statusCode=301, reasonPhrase=Moved Permanently (http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гÑдÑавлÑка-ÑпеÑÑеÑнÑка/5pen7jcp.html)])
WARN  [StandardRobotsTxtProvider] Not able to obtain robots.txt at: http://www.mascus.comнÑка/5pen7jcp.html/robots.txt
java.lang.IllegalArgumentException: Illegal character in authority at index 7: http://www.mascus.comнÑка/5pen7jcp.html/robots.txt
        at java.net.URI.create(URI.java:859)
        at org.apache.http.client.methods.HttpGet.<init>(HttpGet.java:69)
        at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:82)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:98)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.access$500(HttpQueuePipeline.java:39)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline$RobotsTxtFiltersStage.executeStage(HttpQueuePipeline.java:82)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
        at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.queueRedirectURL(DocumentFetcherStage.java:124)
        at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:59)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:297)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:723)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Illegal character in authority at index 7: http://www.mascus.comнÑка/5pen7jcp.html/robots.txt
        at java.net.URI$Parser.fail(URI.java:2829)
        at java.net.URI$Parser.parseAuthority(URI.java:3167)
        at java.net.URI$Parser.parseHierarchical(URI.java:3078)
        at java.net.URI$Parser.parse(URI.java:3034)
        at java.net.URI.<init>(URI.java:595)
        at java.net.URI.create(URI.java:857)
        ... 20 more
DEBUG [QueueReferenceStage] Queued for processing: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-%C3%90%C2%B3%C3%91%C2%96%C3%90%C2%B4%C3%91%C2%80%C3%90%C2%B0%C3%90%C2%B2%C3%90%C2%BB%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0-%C3%91%C2%81%C3%90%C2%BF%C3%90%C2%B5%C3%91%C2%86%C3%91%C2%82%C3%90%C2%B5%C3%91%C2%85%C3%90%C2%BD%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0/5pen7jcp.html
DEBUG [Pipeline] Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@4f63bb7b
DEBUG [AbstractCrawler] test-crawler: Processing reference: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-%C3%90%C2%B3%C3%91%C2%96%C3%90%C2%B4%C3%91%C2%80%C3%90%C2%B0%C3%90%C2%B2%C3%90%C2%BB%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0-%C3%91%C2%81%C3%90%C2%BF%C3%90%C2%B5%C3%91%C2%86%C3%91%C2%82%C3%90%C2%B5%C3%91%C2%85%C3%90%C2%BD%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0/5pen7jcp.html
DEBUG [AbstractDelay] Thread pool-1-thread-2 sleeping for 2.79111768E12 seconds.
DEBUG [FileJobStatusStore] Status serialization directory: /home/niels/Projects/UMF-crawler/vendor/test/./progress
DEBUG [FileJobStatusStore] Created status file: /home/niels/Projects/UMF-crawler/vendor/test/./progress/latest/status/test-crawler__test-crawler.job
DEBUG [FileJobStatusStore] Writing status file: /home/niels/Projects/UMF-crawler/vendor/test/./progress/latest/status/test-crawler__test-crawler.job
DEBUG [FileJobStatusStore] Writing status file: /home/niels/Projects/UMF-crawler/vendor/test/./progress/latest/status/test-crawler__test-crawler.job
DEBUG [AbstractCrawler] test-crawler: 00:00:00.329 to process: http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html
DEBUG [GenericDocumentFetcher] Fetching document: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-%C3%90%C2%B3%C3%91%C2%96%C3%90%C2%B4%C3%91%C2%80%C3%90%C2%B0%C3%90%C2%B2%C3%90%C2%BB%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0-%C3%91%C2%81%C3%90%C2%BF%C3%90%C2%B5%C3%91%C2%86%C3%91%C2%82%C3%90%C2%B5%C3%91%C2%85%C3%90%C2%BD%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0/5pen7jcp.html
DEBUG [GenericDocumentFetcher] Encoded URI: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-%C3%90%C2%B3%C3%91%C2%96%C3%90%C2%B4%C3%91%C2%80%C3%90%C2%B0%C3%90%C2%B2%C3%90%C2%BB%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0-%C3%91%C2%81%C3%90%C2%BF%C3%90%C2%B5%C3%91%C2%86%C3%91%C2%82%C3%90%C2%B5%C3%91%C2%85%C3%90%C2%BD%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0/5pen7jcp.html
DEBUG [GenericDocumentFetcher] Unsupported HTTP Response: HTTP/1.1 400 Bad Request
INFO  [CrawlerEventManager]       REJECTED_BAD_STATUS: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-%C3%90%C2%B3%C3%91%C2%96%C3%90%C2%B4%C3%91%C2%80%C3%90%C2%B0%C3%90%C2%B2%C3%90%C2%BB%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0-%C3%91%C2%81%C3%90%C2%BF%C3%90%C2%B5%C3%91%C2%86%C3%91%C2%82%C3%90%C2%B5%C3%91%C2%85%C3%90%C2%BD%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0/5pen7jcp.html (Subject: HttpFetchResponse [crawlState=BAD_STATUS, statusCode=400, reasonPhrase=Bad Request])
DEBUG [Pipeline] Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@289278d5
DEBUG [FileJobStatusStore] Writing status file: /home/niels/Projects/UMF-crawler/vendor/test/./progress/latest/status/test-crawler__test-crawler.job
DEBUG [FileJobStatusStore] Writing status file: /home/niels/Projects/UMF-crawler/vendor/test/./progress/latest/status/test-crawler__test-crawler.job
DEBUG [AbstractCrawler] test-crawler: 00:00:02.956 to process: http://www.mascus.com/agriculture/used-other-tractor-accessories/other-%C3%90%C2%B3%C3%91%C2%96%C3%90%C2%B4%C3%91%C2%80%C3%90%C2%B0%C3%90%C2%B2%C3%90%C2%BB%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0-%C3%91%C2%81%C3%90%C2%BF%C3%90%C2%B5%C3%91%C2%86%C3%91%C2%82%C3%90%C2%B5%C3%91%C2%85%C3%90%C2%BD%C3%91%C2%96%C3%90%C2%BA%C3%90%C2%B0/5pen7jcp.html
INFO  [AbstractCrawler] test-crawler: Re-processing orphan references (if any)...
DEBUG [AbstractCrawler] test-crawler: Crawler thread #1 started.
DEBUG [AbstractCrawler] test-crawler: Crawler thread #2 started.
INFO  [AbstractCrawler] test-crawler: Reprocessed 0 orphan references...
INFO  [AbstractCrawler] test-crawler: 2 reference(s) processed.
DEBUG [AbstractCrawler] test-crawler: Removing empty directories
INFO  [CrawlerEventManager]          CRAWLER_FINISHED (Subject: com.norconex.collector.http.crawler.HttpCrawler@a851e2d)
INFO  [AbstractCrawler] test-crawler: Crawler completed.
INFO  [AbstractCrawler] test-crawler: Crawler executed in 5 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
DEBUG [FileJobStatusStore] Writing status file: /home/niels/Projects/UMF-crawler/vendor/test/./progress/latest/status/test-crawler__test-crawler.job
INFO  [JobSuite] Running test-crawler: END (Wed Dec 09 11:53:00 CET 2015)

Note that the crawler detects the redirect as http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гÑдÑавлÑка-ÑпеÑÑеÑнÑка/5pen7jcp.html when it should be http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html and then further tries to access the robots.txt at http://www.mascus.comнÑка/5pen7jcp.html/robots.txt which is an invalid hostname resulting in an exception.

essiembre commented 8 years ago

That's the most concise use case I have seen so far. :-)

It seems to appear when fetching robot.txt only. Will investigate.

essiembre commented 8 years ago

After some research, I found the problem is with the server not properly encoding redirect URLs. The best explanation summary I found is here: http://stackoverflow.com/a/7654605/3974380

RFC 2616 specifies that the Location header should contain a URI as defined by RFC 1630, which requires a URI be 7-bit clean ASCII with any special characters URL encoded.

In other words, the server is delivering the URI incorrectly and should be escaping it.

After analyzing at the "Location:" in HTTP headers that come back, I can confirm the redirect URL is not encoded properly. You should contact the site owner about this.

I am not sure how a workaround could be implemented other than forcing to read the HTTP "Location" header using a specific charset, or trying to auto-detect it. It could be a risky proposition given most sites probably respect the standard. In this specific case, I can read the URL properly if I force it to use ISO-8829_1 (UTF-8 does not work).

niels commented 8 years ago

Thanks for the investigation; very interesting.

I agree that non-ASCII characters shouldn't be present in HTTP headers (unless they are properly encoded / escaped). In practice however, the standard unfortunately seems to be violated quite frequently – as is so often the case on the web. The site referenced here is obviously a major case in point, but this Google search suggests that globally this isn't as rare a problem as one might hope. I also unearthed many bug reports for both server- and client-side software components that lamented seeming mis-handling of non-ASCII redirects further confirming that the problem is somewhat frequently encountered.

For compatibility reasons, browsers seem to be more relaxed than the RFCs would demand. At least Firefox and Chrome seem to follow the redirect "correctly" (meaning: as the site author intended). E.g. if I go to http://www.mascus.com/agriculture/used-other-tractor-accessories/other/5pen7jcp.html in either browser, I get redirected to http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html even though the Location header is not properly encoded.

For Firefox, https://bugzilla.mozilla.org/show_bug.cgi?id=1142083 details the fix while https://bugzilla.mozilla.org/show_bug.cgi?id=439616 gives some more information about the use-case.

Generally speaking, I would prefer for a crawler to behave as similarly to real-world browsers as possible. This is because site authors generally target the latter and not the former. If I can access a site with my web browser, I would expect the crawler to be able to access that same page (and parse it in the same manner).

At the same time however, development resources here are of course much more limited than for the major browsers. Thus we can not come up with an implementation that will work as "expected" in all cases. From a philosophical standpoint as well, I would normally be opposed to programming special / edge cases into general-purpose software such as this crawler.

Nevertheless, choking on – what appears to be – a somewhat common encoding of redirects seems to be a not insignificant flaw. Thus, I would like to propose the following implementation which I think strikes a good balance between compatibility and complexity:

  1. Header contains only ASCII charcters (i.e. only code points <= 128)?
    1. Yes: Done.
    2. No: Does the response specify an encoding in the Content-Type header?
      1. Yes: Treat (1) as being encoded in that charset.
      2. No: Treat (1) as being encoded in platform default charset (or hardcode either ISO-8859-1, ISO-8859-15, or UTF-8 as the default).

An interesting alternative to (1.ii.b) would be to fall back to a per-crawler default (if configured) instead. This is a feature that you have suggested in https://github.com/Norconex/collector-http/issues/194#issuecomment-162599045 and which I would find very useful.

This logic could be applied to all HTTP headers, not just Location.

While the logic sounds simple, I can't estimate the implementation effort as I am not yet sufficiently familiar with the codebase. Please feel free to close as WONTFIX if it would be a major hassle.

essiembre commented 8 years ago

Thanks for your research and suggestions! I am in agreement standards are often not respected. What is important is we cover the standards first, but let's not limit ourselves to that and let's try to support what's in the real world. What you are proposing makes lots of sense and I now plan to implement that (or very similar).

essiembre commented 8 years ago

I have added a new configuration option in the latest snapshot. There is now a new redirectURLProvider tag which allow custom implementations. The default implementation is GenericRedirectURLProvider and applies the logic you proposed, slightly modified. Please try the following, which should solve your case:

<redirectURLProvider 
      class="com.norconex.collector.http.redirect.impl.GenericRedirectURLProvider"
      fallbackCharset="ISO-8859-1" />
niels commented 8 years ago

This is perfect! The latest snapshot follows all redirects "correctly" when an appropriate fallbackCharset has been set.

Thanks a lot for your diligence on this.