ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Error Scraping BioMedCentral #10

Closed seesmith closed 10 years ago

seesmith commented 10 years ago

quickscrape --url http://www.biomedcentral.com/1471-2148/14/128/abstract --scraper journal-scrapers/peerj.json --output ./dinoout info: all dependencies installed :) info: quickscrape launched with... info: - URL: http://www.biomedcentral.com/1471-2148/14/128/abstract info: - Scraper: journal-scrapers/peerj.json info: - Rate limit: 3 per minute info: - Log level: info info: urls to scrape: 1 info: processing URL: http://www.biomedcentral.com/1471-2148/14/128/abstract data: fulltext_pdf: http://www.biomedcentral.com/content/pdf/1471-2148-14-128.pdf data: title: New clade of enigmatic early archosaurs yields insights into early pseudosuchian phylogeny and the biogeography of the archosaur radiation data: author: Richard J Butler data: author: Corwin Sullivan data: author: Martín D Ezcurra data: author: Jun Liu data: author: Agustina Lecuona data: author: Roland B Sookias data: date: 2014-06-10 data: doi: 10.1186/1471-2148-14-128 data: volume: 14 data: issue: 1 data: firstpage: 128 data: description: The origin and early radiation of archosaurs and closely related taxa (Archosauriformes) during the Triassic was a critical event in the evolutionary history of tetrapods. This radiation led to the dinosaur-dominated ecosystems of the Jurassic and Cretaceous, and the high present-day archosaur diversity that includes around 10,000 bird and crocodylian species. The timing and dynamics of this evolutionary radiation are currently obscured by the poorly constrained phylogenetic positions of several key early archosauriform taxa, including several species from the Middle Triassic of Argentina (Gracilisuchus stipanicicorum) and China (Turfanosuchus dabanensis, Yonghesuchus sangbiensis). These species act as unstable ‘wildcards’ in morphological phylogenetic analyses, reducing phylogenetic resolution. info: waiting for 1 downloads to complete in background

/usr/local/lib/node_modules/quickscrape/node_modules/jsdom/lib/jsdom/browser/utils.js:9 raise.call(this, "error", "NOT IMPLEMENTED" + (nameForErrorMessage ? ": ^ TypeError: Cannot call method 'call' of undefined at new (/usr/local/lib/node_modules/quickscrape/node_modules/jsdom/lib/jsdom/browser/utils.js:9:13) at Object.j as log at ra (file://connect.facebook.net/en_GB/all.js#xfbml=1:78:2933) at file://connect.facebook.net/en_GB/all.js#xfbml=1:78:3645 at file://connect.facebook.net/en_GB/all.js#xfbml=1:66:908 at Array.forEach (native) at w (file://connect.facebook.net/en_GB/all.js#xfbml=1:28:757) at Object.g.fire (file://connect.facebook.net/en_GB/all.js#xfbml=1:66:868) at s (file://connect.facebook.net/en_GB/all.js#xfbml=1:124:1269) at file://connect.facebook.net/en_GB/all.js#xfbml=1:124:1580

seesmith commented 10 years ago

Same with generic_open scraper

rossmounce commented 10 years ago

That paper is extremely new: "Published: 10 June 2014" perhaps that has something to do with the problem?

seesmith commented 10 years ago

Shall I try an older one from the same journal?

seesmith commented 10 years ago

same error with this one http://www.biomedcentral.com/1471-2148/7/67 published in 2007

blahah commented 10 years ago

Most of the scraping is working - you should get the results.json file and the fulltext HTML, but it's crashing while the PDF is downloading. It's a problem with the BMC Facebook plugin, but I haven't got it figured out yet. I expect it will break on most BMC papers regardless of age until I can debug it.

blahah commented 10 years ago

OK, found the problem. When loading the rendered page in order to run the XPath extractors, jsdom is trying to download and run the remote Javascript scripts linked from the page, including the Facebook plugin. It uses some aspects of the DOM that aren't implemented in jsdom. The solution is to tell jsdom never to download and run external scripts (since these have already been evaluated by the headless browser, we don't need them re-run). I'll push a fix shortly.

blahah commented 10 years ago

This is fixed in version 0.1.7, which can now be downloaded from npm.

To upgrade your install just run:

sudo npm install --global quickscrape