ContentMine / scraperJSON

The scraperJSON standard for defining web scrapers as JSON objects
Creative Commons Zero v1.0 Universal
33 stars 2 forks source link

scraperJSON

The scraperJSON standard for defining web scrapers as JSON objects.

Purpose

scraperJSON is a JSON schema for defining web scrapers in a standardised way. Defining web scrapers in such a way enables mass-scale scraping and mining of similar data from many different sources, for example:

Status

The specification is still in early drafting and is currently evolving very fast as our understanding of the potential needs of the system in real use develops.

Because of this, the standard is simple described in text here, with a reference implementation in a Node.js library, thresher, and a command-line app quickscrape.

The schema will be formally defined once we reach a stable set of features.

Specification

The current schema is described below.

There can be two keys in the root object:

Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are:

Example:

{
  "url": "plos.*\\.org",
  "elements": {
    "fulltext_pdf": {
      "selector": "//meta[@name='citation_pdf_url']",
      "attribute": "content",
      "regex": {
        "flags": ["g", "m"],
        "source": "(\\w+)"
      },
      "download": {
        "rename": "fulltext.pdf"
      }
    },
    "title": {
      "selector": "//meta[@name='citation_title']"
    }
  }
}

Changelog

0.0.1 - add download renaming 0.0.2 - add regex