ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Generalized post-processing #16

Open noamross opened 10 years ago

noamross commented 10 years ago

Some fields may require post-processing beyond regex, which would be defined in the scraper, such as date fields which may be in various formats depending on the journal. One way to generalize this would be like so:

"date": {
      "selector": "//meta[@name='citation_date']",
      "attribute": "content"
      "post-processor": {
           "name:"  "dateconvert"
           "arguments:" {"format": "%B %d, %Y"}
       }
}

dateconvert.js could be placed either in the working directory, scrapers directory, or in a processors directory, and would take argument dateconvert(element, arguments).

This would allow the flexibility of user-defined functions without going all Zotero.

blahah commented 10 years ago

I agree about the need for post processing support of some kind. I'm keen to keep the format simple and to consider the impact of a design choice for ScraperJSON on future users. Excuse my thinking out loud below.

One of the strengths of ScraperJSON is that someone can create a scraper without knowing programming, using just inspector tools from a browser and a little knowledge of XPaths. Another strength is that large collections of scrapers can be defined for different sites to extract the same data. This was a design goal because we hope to enable a community of interested volunteers to help create and maintain scrapers for up to 24,000 journals (!!).

I would hope that some of our future users will not be programmers, and if that's the case it would be enabling to avoid having functions and arguments described in the scraper. On the other hand, perhaps the pool of people savvy enough to write XPath expressions but not willing to learn programming is small, and the goal of excluding anything that looks like programming not realistic. The conflict in the case of the ContentMine ecosystem will go away when I move to the next phase, which is to create web GUI tools for automating scraper creation. This will enable non-programmer volunteers in a much more user-friendly way.

An alternative design that doesn't involve programming could involve having free-form notes on any element. An particular application could choose to require these to be structured, e.g.

"date": {
      "selector": "//meta[@name='citation_date']",
      "attribute": "content"
      "notes": {
           "dateformat:" "%B %d, %Y"
       }
}
"date": {
      "selector": "//meta[@name='citation_author']",
      "attribute": "content"
      "notes": {
           "nameformat:" "first, last"
       }
}

These could then be processed or ignored by the scraping application as they see fit.

Very keen to hear ideas about which is better.

noamross commented 10 years ago

This looks good. But I would think that name and date standardization (and possibly other post-processing tasks) would be sufficiently widely applicable that users would want the option of simply simply selecting these "plugins" for post-processing. In this case you'd want to make sure that the data in notes was standardized for plugin use.

A general note on making these accessible to non-programmers: I think YAML notation is more human-readable, but then you'd have ScraperYAML.

noamross commented 10 years ago

Another thought on this. Perhaps the "notes" field would be a place to put software-specific info, e.g.,

  "url": "\\w+\\.\\w+",
  "elements": {
  "date": {
        "selector": "//meta[@name='citation_author']",
        "attribute": "content"
        "notes": {
             "quickscrape:" {
                 "nameformat:" "%M %Y"
               }
           }
    }
   "notes:" {
        "quickscrape": {
             "ratelimit": {
                 "minute": 1
                 "day": 1000
                 "mode": stochastic_sim
                }
             }
         }
}