ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Page not rendering #11

Closed blahah closed 10 years ago

blahah commented 10 years ago

See this tweet.

When I run quickscrape on this Hindawi URL with the generic_open.json journal scraper, it stores null as the rendered page.

blahah commented 10 years ago

To test whether this was PhantomJS failing I made a simple test: this should generate a screenshot of the rendered page in phantomjs...

var page = require('webpage').create();
page.open('http://www.hindawi.com/journals/jgr/2012/808729/', function() {
  page.render('hindawi.png');
  phantom.exit();
});

then run it with PhantomJS

phantomjs hindawi_screenshot.js

And we get a fully rendered page. It's not PhantomJS.

hindawi

blahah commented 10 years ago

Next testing CasperJS.

var casper = require('casper').create({
  verbose: false,
  logLevel: "debug"
});
casper.start('http://www.hindawi.com/journals/jgr/2012/808729/', function() {
    this.capture('hindawi_casper.png');
});
casper.run();
casperjs casper_hindawi-screen.js --verbose
[info] [phantom] Starting...
[info] [phantom] Running suite: 2 steps
[debug] [phantom] opening url: http://www.hindawi.com/journals/jgr/2012/808729/, HTTP GET
[debug] [phantom] Navigation requested: url=http://www.hindawi.com/journals/jgr/2012/808729/, type=Other, willNavigate=true, isMainFrame=true
[debug] [phantom] url changed to "http://www.hindawi.com/journals/jgr/2012/808729/"
[debug] [phantom] Navigation requested: url=http://images.hindawi.com/logo/hindawi.svg, type=Other, willNavigate=true, isMainFrame=false
[debug] [phantom] Successfully injected Casper client-side utilities
[info] [phantom] Step anonymous 2/2 http://www.hindawi.com/journals/jgr/2012/808729/ (HTTP 200)
[debug] [phantom] Capturing page to /Users/rds45/code/quickscrape/hindawi_casper.png
[info] [phantom] Capture saved to /Users/rds45/code/quickscrape/hindawi_casper.png
[info] [phantom] Step anonymous 2/2: done in 12702ms.
[info] [phantom] Done 2 steps in 12702ms

And again, perfectly rendered. Not CasperJS.

hindawi_casper