quickscrape
is a simple command-line tool for powerful, modern website scraping.
quickscrape
is not like other scraping tools. It is designed to enable large-scale content mining. Here's what makes it different:
Websites can be rendered in a GUI-less browser (PhantomJS via CasperJS). This has some important benefits:
Scrapers are defined in separate JSON files that follow a defined structure (scraperJSON). This too has important benefits:
quickscrape
is being developed to allow the community early access to the technology that will drive ContentMine, such as ScraperJSON and our Node.js scraping library thresher.
The software is under rapid development, so please be aware there may be bugs. If you find one, please report it on the issue tracker.
You'll need Node.js (node
), a platform which enables standalone JavaScript apps. You'll also need the Node package manager (npm
), which usually comes with Node.js. Installing Node.js via the operating system's package manager leads to issues. If you already have Node.js installed, and it requires sudo
to install node packages, that's the wrong way. The easiest way to do it right on Unix systems (e.g. Linux, OSX) is to use NVM, the Node version manager.
First, install NVM:
curl https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash
or, if you don't have curl
:
wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash
NB: on OSX, you will need to have the developer tools installed (e.g. by installing XCode).
Then, install the latest Node.js, which will automatically install the latest npm
as well, and set that version as the default:
source ~/.nvm/nvm.sh
nvm install 0.10
nvm alias default 0.10
nvm use default
Now you should have node
and npm
available. Check by running:
node -v
npm -v
If both of those printed version numbers, you're ready to move on to installing quickscrape
.
quickscrape
is very easy to install. Simply:
npm install --global quickscrape
Run quickscrape --help
from the command line to get help:
Usage: quickscrape [options]
Options:
-h, --help output usage information
-V, --version output the version number
-u, --url <url> URL to scrape
-r, --urllist <path> path to file with list of URLs to scrape (one per line)
-s, --scraper <path> path to scraper definition (in JSON format)
-d, --scraperdir <path> path to directory containing scraper definitions (in JSON format)
-o, --output <path> where to output results (directory will be created if it doesn't exist
-r, --ratelimit <int> maximum number of scrapes per minute (default 3)
-h --headless render all pages in a headless browser
-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error, or debug)
-f, --outformat <name> JSON format to transform results into (currently only bibjson)
You must provide scraper definitions in ScraperJSON format as used in the ContentMine journal-scrapers.
First, you'll want to grab some pre-cooked definitions:
git clone https://github.com/ContentMine/journal-scrapers.git
Now just run quickscrape
:
quickscrape \
--url https://peerj.com/articles/384 \
--scraper journal-scrapers/scrapers/peerj.json \
--output peerj-384
--outformat bibjson
You'll see log messages informing you how the scraping proceeds:
Then in the peerj-384
directory there are several files:
$ tree peerj-384
peerj-384/
└── https_peerj.com_articles_384
├── bib.json
├── fig-1-full.png
├── fulltext.html
├── fulltext.pdf
├── fulltext.xml
└── results.json
fulltext.html
is the fulltext HTML (duh!)results.json
is a JSON file containing all the captured databib.json
is a JSON file containing the results in bibJSON formatfig-1-full.png
is the downloaded image from the only figure in the paperresults.json
looks like this (truncated):
{
"publisher": {
"value": [
"PeerJ Inc."
]
},
"journal_name": {
"value": [
"PeerJ"
]
},
"journal_issn": {
"value": [
"2167-8359"
]
},
"title": {
"value": [
"Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss"
]
},
"keywords": {
"value": [
"Pendred; MLPA; DFNB4; \n SLC26A4\n ; FOXI1 and KCNJ10; Genotyping; Genetics; SNHL"
]
},
"author_name": {
"value": [
"Lynn M. Pique",
"Marie-Luise Brennan",
"Colin J. Davidson",
"Frederick Schaefer",
"John Greinwald Jr",
"Iris Schrijver"
]
}
}
bib.json
looks like this (truncated):
{
"title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss",
"link": [
{
"type": "fulltext_html",
"url": "https://peerj.com/articles/384"
},
{
"type": "fulltext_pdf",
"url": "https://peerj.com/articles/384.pdf"
},
{
"type": "fulltext_xml",
"url": "/articles/384.xml"
}
],
"author": [
{
"name": "Lynn M. Pique",
"institution": "Department of Pathology, Stanford University Medical Center, Stanford, CA, USA"
},
{
"name": "Marie-Luise Brennan",
"institution": "Department of Pediatrics, Stanford University Medical Center, Stanford, CA, USA"
}
]
}
We are not yet accepting contributions, if you'd like to help please drop me an email (richard@contentmine.org) and I'll let you know when we're ready for that.
text
and html
Copyright (c) 2014 Shuttleworth Foundation Licensed under the MIT license.