ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Error: malformed URL: -r #43

Closed rossmounce closed 9 years ago

rossmounce commented 9 years ago

Very reproducible bug:

$ quickscrape -url http://ijs.sgmjournals.org/content/64/Pt_5/1802.full --loglevel verbose --scraper jscrapers/scrapers/ijsem.json  --output ./new --outformat bibjson
info: quickscrape launched with...
info: - URL: -r
info: - Scraper: jscrapers/scrapers/ijsem.json
info: - Rate limit: 3 per minute
info: - Log level: verbose
info: urls to scrape: 1
info: processing URL: -r

/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:29
    throw e;
          ^
Error: malformed URL: -r; protocol missing (must include http(s):// or ftp(s)://), domain missing
    at Object.url.checkUrl (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:28:13)
    at Thresher.scrape (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:54:7)
    at processUrl (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:183:7)
    at null._onTimeout (/home/jing/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:206:5)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
rossmounce commented 9 years ago

Hmmm... could be that I'm getting paywalled / blocked (redirected?) rather than an error with quickscrape per se?

blahah commented 9 years ago

no, it's that you missed out a - in the --url in the command, so quickscrape interpreted it as -u rl

rossmounce commented 9 years ago

oooops! good catch. silly me

blahah commented 9 years ago

:)

rossmounce commented 9 years ago

Still can't work out why that url failed though in my bash while loop. I'll post that as a separate issue