The scraperJSON standard for defining web scrapers as JSON objects.
scraperJSON is a JSON schema for defining web scrapers in a standardised way. Defining web scrapers in such a way enables mass-scale scraping and mining of similar data from many different sources, for example:
The specification is still in early drafting and is currently evolving very fast as our understanding of the potential needs of the system in real use develops.
Because of this, the standard is simple described in text here, with a reference implementation in a Node.js library, thresher, and a command-line app quickscrape.
The schema will be formally defined once we reach a stable set of features.
The current schema is described below.
There can be two keys in the root object:
Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are:
text
). In addition to html attributes there are two special attributes allowed:
text
- extracts any plaintext inside the selected elementhtml
- extracts the inner HTML of the selected elementtrue
or an Object) the element is treated as a URL to a resource and is downloaded. Optional (omitting this key is equivalent to giving it a value of false
). If the value is an object, the following keys are allowed:
rename
- a string specifying the filename to which the downloaded file will be renamed.g
) is specified, the result will be an array of arrays of captured groups. There are two keys allowed:
source
- a string specifying the regular expression to be executed. Requiredflags
- an array specifying the regex flags to be used (g
, m
, i
, etc.). Optional (omitting this key will cause the regex to be executed with no flags).Example:
{
"url": "plos.*\\.org",
"elements": {
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"regex": {
"flags": ["g", "m"],
"source": "(\\w+)"
},
"download": {
"rename": "fulltext.pdf"
}
},
"title": {
"selector": "//meta[@name='citation_title']"
}
}
}
0.0.1 - add download renaming 0.0.2 - add regex