ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
260 stars 43 forks source link

Cannot run Quickscrape in headless mode #63

Open lanzer opened 8 years ago

lanzer commented 8 years ago

running quickscrape with the -h or --headless option will not launch casper

I noticed that the headless parameter is not being passed when adding the scraper through scraperbox:

quickscrape.js (211)

    scrapers.addScraper(program.scraper);

Also, scraperbox need to pass the parameter to scraper, which is ready to listen to the parameter

scraperbox.js (48)

ScraperBox.prototype.addScraper = function(def) {
  if (typeof(def) == 'string') {
    def = JSON.parse(fs.readFileSync(def, 'utf8'));
  }
  var scraper = new Scraper(def);
  if (scraper.valid) {
    this.scrapers.push(scraper);
    return true;
  } else {
    return false;
  }
}

So I've made the following adjustments:

quickscrape.js (211)

    scrapers.addScraper(program.scraper, program.headless);

scraperbox.js (48)

ScraperBox.prototype.addScraper = function(def, headless) {
  if (typeof(def) == 'string') {
    def = JSON.parse(fs.readFileSync(def, 'utf8'));
  }
  var scraper = new Scraper(def, headless);
  if (scraper.valid) {
    this.scrapers.push(scraper);
    return true;
  } else {
    return false;
  }
}

I think the scraper checking routine also need to have the parameter added:

quickscrape.js (139)

  var scraper = new Scraper(JSON.parse(definition), program.headless);

Now I can see casperjs running. I noticed that 404 type status also result in the quickscrape halting. Will need to look into that.

blahah commented 8 years ago

Thanks for the report. Pull requests are very welcome with the changes you have made :)

lanzer commented 8 years ago

Done. Should I do the same for the "response status" bug also?

blahah commented 8 years ago

Thank very much. If you are willing, yes a PR fixing any bug is welcome :)