ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

scraper directory uses the user-input output directory #95

Closed rossmounce closed 8 years ago

rossmounce commented 8 years ago

I installed quickscrape as per the readme instructions I cloned the example journal scrapers repo.

I tried the first peerj-384 example in the readme, but it didn't work.

The problem is it appears to be looking for the scraper file inside of the specified output folder! e.g. instead of looking in: journal-scrapers/scrapers/peerj.json

it looks for the scraper file in: peerj-384/journal-scrapers/scrapers/peerj.json

A quick workaround is just to specify output folder as .

$ quickscrape -V
0.4.7
$ node -v
v0.10.48
$ npm -v
2.15.1
$ quickscrape \
>   --url https://peerj.com/articles/384 \
>   --scraper journal-scrapers/scrapers/peerj.json \
>   --output peerj-384
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /home/ross/Downloads/pica/peerj-384/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info

fs.js:439
  return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
                 ^
Error: ENOENT, no such file or directory '/home/ross/Downloads/pica/peerj-384/journal-scrapers/scrapers/peerj.json'
    at Object.fs.openSync (fs.js:439:18)
    at Object.fs.readFileSync (fs.js:290:15)
    at Object.<anonymous> (/home/ross/.nvm/v0.10.48/lib/node_modules/quickscrape/bin/quickscrape.js:138:23)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:945:3
ross@ross-x3:~/Downloads/pica$ quickscrape   --url https://peerj.com/articles/384   --scraper journal-scrapers/scrapers/peerj.json   --output .
info: quickscrape 0.4.7 launched with...
info: - URL: https://peerj.com/articles/384
info: - Scraper: /home/ross/Downloads/pica/journal-scrapers/scrapers/peerj.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: https://peerj.com/articles/384
info: [scraper]. URL rendered. https://peerj.com/articles/384.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.xml.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. fulltext.pdf.
info: [scraper]. download started. fig-1-full.png.
info: URL processed: captured 28/34 elements (6 captures failed)
info: all tasks completed

$ tree https_peerj.com_articles_384/
https_peerj.com_articles_384/
├── fig-1-full.png
├── fulltext.html
├── fulltext.pdf
├── fulltext.xml
└── results.json

0 directories, 5 files
rossmounce commented 8 years ago

Looks like this issue has been reported before, but not fixed yet in the main code: https://github.com/ContentMine/quickscrape/issues/56

tarrow commented 8 years ago

yep; these are quite old bugs but since fewer people see to have been interested in quickscrape and work has been going on with updating thresher the code to fix this hasn't made it to the master branch yet. Have a look at tarrow/master for a place where lots of these fixes are.

rossmounce commented 8 years ago

ah cool. I shall have a look at tarrow/master then, thx :)

rossmounce commented 8 years ago

@tarrow erm... I just realised I don't know how to compile this kind of code from source. tarrow/master is at https://github.com/tarrow/quickscrape right?

npm install --global quickscrape

won't install your quickscrape will it?

How do I install your updated quickscrape?

tarrow commented 8 years ago

npm install --global tarrow/quickscrape should do it :)

rossmounce commented 8 years ago

cheers. I know literally nothing about npm :sob: