amoilanen / js-crawler

Web crawler for Node.JS
MIT License
253 stars 55 forks source link

getting unknown encoding error on some pages #33

Closed stamanuel closed 7 years ago

stamanuel commented 7 years ago

when parsing various urls, i came across this link: http://www.sanssouci-wien.com/

which on one page seems to throw the following error:

buffer.js:497
          throw new TypeError('Unknown encoding: ' + encoding);
          ^

TypeError: Unknown encoding: none
    at Buffer.slowToString (buffer.js:497:17)
    at Buffer.toString (buffer.js:510:27)
    at Crawler._getDecodedBody (/node_modules/js-crawler/crawler.js:267:24)
    at /node_modules/js-crawler/crawler.js:221:37
    at Request._callback (/node_modules/js-crawler/crawler.js:183:7)
    at Request.self.callback (/node_modules/js-crawler/node_modules/request/request.js:368:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (/node_modules/js-crawler/node_modules/request/request.js:1219:14)
    at emitOne (events.js:101:20)
    at Request.emit (events.js:188:7)
    at IncomingMessage.<anonymous> (/node_modules/js-crawler/node_modules/request/request.js:1167:12)
    at emitNone (events.js:91:20)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

couldn't figure out yet which page exactly is to blame, here are my configs for the crawler:

new jsCrawler().configure({ depth: 3, maxRequestsPerSecond: 10, maxConcurrentRequests: 5, shouldCrawl: function (url) { let simplifiedUrl = starturl.substring(starturl.indexOf('//') + 2).replace('www.', ''); return url.includes(simplifiedUrl); } })

amoilanen commented 7 years ago

Looks like there is a broken link http://www.sanssouci-wien.com/sitex/index.php/page.100 for which the server returns a wrong value for the content encoding header

Content-Encoding: none

However the crawler should still handle such situations and should not fail, in the case when the specified encoding is wrong we will try to read the page content using the default encoding: 'utf-8'.

I will add a unit test and fix this issue. Thank you for reporting it.

amoilanen commented 7 years ago

Fixed the issue, published to NPM in the version 0.3.15