ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

Fails when using relative paths - depends on platform? #72

Open robintw opened 8 years ago

robintw commented 8 years ago

I have a slightly strange problem with quickscrape.

I want to run something like this: quickscrape --urllist test_dois.txt --scraper ../journal-scrapers/scrapers/plos.json --output plos-test2

That is, I want to use relative paths for the URL list and the scraper file.

When running this on OS X it works fine, but when running on my Linux server I get an error saying that it can't find the urllist file.

Simplifying this a bit and looking just at the urllist file, if I run ./quickscrape.js --urllist test_dois.txt --scraper /mnt/cm-volume/content-mine/journal-scrapers/scrapers/plos.json --output plos-test2 I get:

info: quickscrape 0.4.7 launched with...
info: - URLs from file: undefined
info: - Scraper: /mnt/cm-volume/content-mine/journal-scrapers/scrapers/plos.json
info: - Rate limit: 3 per minute
info: - Log level: info

fs.js:427
  return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
                 ^
Error: ENOENT, no such file or directory 'test_dois.txt'
    at Object.fs.openSync (fs.js:427:18)
    at Object.fs.readFileSync (fs.js:284:15)
    at loadUrls (/mnt/cm-volume/content-mine/quickscrape/bin/quickscrape.js:154:17)
    at Object.<anonymous> (/mnt/cm-volume/content-mine/quickscrape/bin/quickscrape.js:164:41)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)

I have absolutely no idea why this is behaving differently on Linux to OS X.

Interesting, I seem to be able to fix this error by moving the process.chdir call further down the file - so that it is called only after the URL list has been loaded (see the diff at https://github.com/ContentMine/quickscrape/compare/master...robintw:relative-paths). This seems to work on both Linux and OS X, and I'm happy to submit this as a PR if that would be useful.

I must say, I'm a bit confused by all of this though - and wondering whether I am being really stupid!