getting unknown encoding error on some pages

stamanuel commented 7 years ago

when parsing various urls, i came across this link: http://www.sanssouci-wien.com/

which on one page seems to throw the following error:

buffer.js:497
          throw new TypeError('Unknown encoding: ' + encoding);
          ^

TypeError: Unknown encoding: none
    at Buffer.slowToString (buffer.js:497:17)
    at Buffer.toString (buffer.js:510:27)
    at Crawler._getDecodedBody (/node_modules/js-crawler/crawler.js:267:24)
    at /node_modules/js-crawler/crawler.js:221:37
    at Request._callback (/node_modules/js-crawler/crawler.js:183:7)
    at Request.self.callback (/node_modules/js-crawler/node_modules/request/request.js:368:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (/node_modules/js-crawler/node_modules/request/request.js:1219:14)
    at emitOne (events.js:101:20)
    at Request.emit (events.js:188:7)
    at IncomingMessage.<anonymous> (/node_modules/js-crawler/node_modules/request/request.js:1167:12)
    at emitNone (events.js:91:20)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

couldn't figure out yet which page exactly is to blame, here are my configs for the crawler:

new jsCrawler().configure({ depth: 3, maxRequestsPerSecond: 10, maxConcurrentRequests: 5, shouldCrawl: function (url) { let simplifiedUrl = starturl.substring(starturl.indexOf('//') + 2).replace('www.', ''); return url.includes(simplifiedUrl); } })

amoilanen commented 7 years ago

Looks like there is a broken link http://www.sanssouci-wien.com/sitex/index.php/page.100 for which the server returns a wrong value for the content encoding header

Content-Encoding: none

However the crawler should still handle such situations and should not fail, in the case when the specified encoding is wrong we will try to read the page content using the default encoding: 'utf-8'.

I will add a unit test and fix this issue. Thank you for reporting it.

amoilanen commented 7 years ago

Fixed the issue, published to NPM in the version 0.3.15

amoilanen / js-crawler

getting unknown encoding error on some pages #33