Closed stamanuel closed 7 years ago
Looks like there is a broken link http://www.sanssouci-wien.com/sitex/index.php/page.100 for which the server returns a wrong value for the content encoding header
Content-Encoding: none
However the crawler should still handle such situations and should not fail, in the case when the specified encoding is wrong we will try to read the page content using the default encoding: 'utf-8'.
I will add a unit test and fix this issue. Thank you for reporting it.
Fixed the issue, published to NPM in the version 0.3.15
when parsing various urls, i came across this link: http://www.sanssouci-wien.com/
which on one page seems to throw the following error:
couldn't figure out yet which page exactly is to blame, here are my configs for the crawler:
new jsCrawler().configure({ depth: 3, maxRequestsPerSecond: 10, maxConcurrentRequests: 5, shouldCrawl: function (url) { let simplifiedUrl = starturl.substring(starturl.indexOf('//') + 2).replace('www.', ''); return url.includes(simplifiedUrl); } })