ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

Adding first scrapers for Acta Cryst. E #26

Closed sauliusg closed 9 years ago

sauliusg commented 9 years ago

Adding the first version of the IUCr Acta Cryst. E journal scrapers. There are four different scrapers for four styles of IUCr URLs, since each URL delivers slightly different content.

A test file with 5 URLs of each kind is provided.

The 'quichscrape' command works as expected when run with each URL individually:

saulius@koala journal-scrapers/ > quickscrape --url 'http://scripts.iucr.org/cgi-bin/paper?S160053681200801X' --scraperdir scrapers

but fails when a list of URLs is passed as an argument:

saulius@koala journal-scrapers/ > quickscrape --urllist test/acta-e_test_urls.txt --scraperdir scrapers info: quickscrape launched with... info: - URLs from file: undefined info: - Scraperdir: scrapers info: - Rate limit: 3 per minute info: - Log level: info info: urls to scrape: 21 info: processing URL: http://dx.doi.org/10.1107/S1600536812006691 info: [scraper]. URL rendered. http://dx.doi.org/10.1107/S1600536812006691. info: [scraper]. download started. fulltext.html. info: [scraper]. download started. fulltext.pdf. info: waiting 20 seconds before next scrape

/usr/lib/node_modules/quickscrape/lib/eventparse.js:63 msg = msg.concat([var1]) ^ TypeError: Cannot call method 'concat' of undefined at Object.module.exports.compose (/usr/lib/node_modules/quickscrape/lib/eventparse.js:63:15) at null. (/usr/lib/node_modules/quickscrape/bin/quickscrape.js:153:18) at EventEmitter.emit (/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:339:22) at null. (/usr/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:69:14) at EventEmitter.emit (/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:339:22) at null.cb (/usr/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:309:15) at Ticker.tick (/usr/lib/node_modules/quickscrape/node_modules/thresher/lib/ticker.js:32:10) at null. (/usr/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:335:20) at EventEmitter.emit (/usr/lib/node_modules/quickscrape/node_modules/thresher/node_modules/eventemitter2/lib/eventemitter2.js:339:22) at WriteStream. (/usr/lib/node_modules/quickscrape/node_modules/thresher/lib/download.js:70:8)

The output for the first URL is created in the output/ dir, however:

saulius@koala journal-scrapers/ > tree output/ output/ └── http_dx.doi.org_10.1107_S1600536812006691 ├── fulltext.html ├── fulltext.pdf └── results.json

blahah commented 9 years ago

Thanks Saulius - the multiple URLs thing is a bug in quickscrape that is high on my to-do list: https://github.com/ContentMine/quickscrape/issues/33.

sauliusg commented 9 years ago

On 2015-01-23 16:51, Richard Smith-Unna wrote:

Thanks Saulius - the multiple URLs thing is a bug in quickscrape that is high on my to-do list: ContentMine/quickscrape#33 https://github.com/ContentMine/quickscrape/issues/33.

Good, please let me know when you fix it. I'll try to start the 'quickscrape' working from the repo, it is a great tool and we would like to use to gather more open crystallographic data. :)

Regards, Saulius

Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Graiciuno 8 LT-02241 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556 mobile: (+370-684)-49802, (+370-614)-36366

sauliusg commented 9 years ago

Hi, Richard,

On 2015-01-23 16:51, Richard Smith-Unna wrote:

Thanks Saulius - the multiple URLs thing is a bug in quickscrape that is high on my to-do list: ContentMine/quickscrape#33 https://github.com/ContentMine/quickscrape/issues/33.

I have pushed the updated Acta Cryst. E scrapers to GitHub -- they now support extraction most of the metadata and download paper texts and abstracts, as the rest of the scrapers.

The new commits seem to have landed in the same pull request that is already pending, so please pull everything if you find it fit.

Regards, Saulius

Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Graiciuno 8 LT-02241 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2602116 / phone (office): (+370-5)-2602556 mobile: (+370-684)-49802, (+370-614)-36366