ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

No handler for status code != 200 #62

Open lanzer opened 9 years ago

lanzer commented 9 years ago

When a status code other than "200 OK" is received, the process would halt. This can be caused by a "404 not found" or server side problem such as exceeded bandwidth, or permission error. It's a problem for me as I am working with a big list of URL with entries that are potentially outdated.

I noticed that under the basic renderer (there is a headless renderer, but it isn't called even with the -h parameter), it doesn't listen for status code other than 200:

basic.js (14)

  request(conf, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      renderer.emit('renderer.urlRendered', url, body);
    } else if (error) {
      this.emit('error', error);
    } 
  });

Also scraper.js does not have a listener for abnormal status:

scraper.js (252)

  renderer.on('renderer.urlRendered', function(theUrl, html) {

I've added a few lines to make things work for me

basic.js (14)

  request(conf, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      renderer.emit('renderer.urlRendered', url, body);
    } else if (error) {
      this.emit('error', error);
    } else if (response.statusCode != 200) {
      renderer.emit ('renderer.status', response.statusMessage);
    }
  });

scraper.js (252)

  renderer.on('renderer.status', function(message) {
    scraper.emit('urlRendered',message);
    scraper.ticker.tick();
  });
  renderer.on('renderer.urlRendered', function(theUrl, html) {

Quickscrape does not read the result as an error and would report "0/0 elements captured (0 capture failed)", when it should read "0/2 elements" or whatever number configured in the JSON. Haven't looked into how reporting is handled.

For the time being, I noticed someting thresher.js

thresher.js (75)

    if (keyscaptured = 0) {

That should probably be a comparison operator.

Hope this helps!

blahah commented 9 years ago

lots of good stuff in here! thanks

lanzer commented 9 years ago

The fix was actually for thresher and not quickscrape. I pushed the changes and it seem to have merged them with my last pull request for another bug, I'm a totally noob so I might have gotten the procedure wrong. Please let me know if I need to make any changes on my end.

blahah commented 9 years ago

Thanks for this @lanzer and sorry for the slow reply - I've been away at various events. I will be incorporating these fixes in new releases in the next few days.

tarrow commented 8 years ago

I'm going to take over having a look at this in the next few days; I also wrote a patch to fix this because I didn't realised there had been one in the pipeline for a while.