ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

can't scrape plos one paper #47

Closed skasberger closed 9 years ago

skasberger commented 9 years ago

tried to scrape via urls.text

workshop@crunchbang:~/workshop$ quickscrape --urllist test/urls.txt --scraperdir test/
info: quickscrape launched with...
info: - URLs from file: undefined
info: - Scraperdir: test/
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 4
info: processing URL: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030039

TypeError: Cannot read property 'actions' of null
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:100:16
    at Request._callback (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:60:5)
    at Request.self.callback (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:360:22)
    at Request.EventEmitter.emit (events.js:98:17)
    at Request.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1202:14)
    at Request.EventEmitter.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1150:12)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)
    at _stream_readable.js:920:16
    at process._tickCallback (node.js:415:13)

urls.txt:

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030039
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003731
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000339
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002791
skasberger commented 9 years ago

mistake was on my side. choose the wrong scraperdir.