amoilanen / js-crawler

Web crawler for Node.JS
MIT License
253 stars 55 forks source link

bug empty response #31

Closed bymaximus closed 7 years ago

bymaximus commented 8 years ago

  return response.headers && response.headers['content-type']
                 ^

TypeError: Cannot read property 'headers' of undefined
    at Crawler._isTextContent (/root/test/node_modules/js-crawler/crawler.js:257:18)
    at /root/test/node_modules/js-crawler/crawler.js:220:30
    at Request._callback (/root/test/node_modules/js-crawler/crawler.js:183:7)
    at self.callback (/root/test/node_modules/request/request.js:186:22)
    at emitOne (events.js:77:13)
    at Request.emit (events.js:169:7)
    at Request.init (/root/test/node_modules/request/request.js:274:17)
    at new Request (/root/test/node_modules/request/request.js:128:8)
    at Crawler.request (/root/test/node_modules/request/index.js:54:10)
    at /root/test/node_modules/js-crawler/crawler.js:181:10```
amoilanen commented 8 years ago

Will look at this case and try to reproduce.

stamanuel commented 7 years ago

i am also having this issue, always when i run this code:

`let crawler = new Crawler().configure({ maxRequestsPerSecond: 10, maxConcurrentRequests: 5 });

crawler.crawl({
    url: 'http://downunder.at/',
    success: function (page) {
        console.log(page.url);
    },
    failure: function (page) {
        console.log(page.status);
    }
});`

i get that error:

`http://downunder.at/ http://downunder.at/start/beer-drinks/ http://downunder.at/partykeller-snakepit-vienna/ http://downunder.at/events/ http://downunder.at/partylocations-for-your-party/ http://downunder.at/unser-gastgarten/ http://downunder.at/next-football-event/ http://downunder.at/wine-lounge/ http://downunder.at/newsite/events/ /node_modules/js-crawler/crawler.js:257 return response.headers && response.headers['content-type'] ^

TypeError: Cannot read property 'headers' of undefined at Crawler._isTextContent (/node_modules/js-crawler/crawler.js:257:18) at /node_modules/js-crawler/crawler.js:220:30 at Request._callback (/node_modules/js-crawler/crawler.js:183:7) at self.callback (/node_modules/js-crawler/node_modules/request/request.js:368:22) at emitOne (events.js:96:13) at Request.emit (events.js:188:7) at Request.init (/node_modules/js-crawler/node_modules/request/request.js:640:17) at new Request (/node_modules/js-crawler/node_modules/request/request.js:272:8) at Crawler.request (/node_modules/js-crawler/node_modules/request/index.js:56:10) at /node_modules/js-crawler/crawler.js:181:10`

amoilanen commented 7 years ago

Hi Manuel,

Thank you for additional details, I managed to reproduce the issue locally and will fix it shortly.

amoilanen commented 7 years ago

There are two bugs which I have a fix for and soon will publish to NPM once corresponding unit tests were added:

  1. When response is null because an error occurred, crawler would still try to access response headers to try to determine the content encoding
  2. Crawler tries to crawl 'mailto' links which is the reason why it fails for http://downunder.at/

So the following happens in the case of 'http://downunder.at/':

Crawler finds a mailto link, tries to crawl it, gets an error response ('unsupported protocol'), but instead of reporting this error it instead tries to read the response's content encoding while the response is null. This is why there is such a cryptic error in the console.

Thank you for your patience and reporting these issues, I will soon close this issue when the fixes have been published to NPM.

amoilanen commented 7 years ago

Published the version 0.3.14 with the fixes included, please, re-open the issue if similar problem is still reproducible with the latest version in some other scenario