ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Adding regex processing to scrapers #12

Closed ianthe closed 9 years ago

ianthe commented 10 years ago

If there's a patternProperties in the json scraper, will try and match against this and save the matched substring as the result.

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-1.81%) when pulling 0459592a7ce4c1c4b0956af2937697f2ca4bdb85 on ianthe:master into 94e6bbb3d7626a4d236b1a2a73ec56f734ef4df3 on ContentMine:master.

blahah commented 10 years ago

Thanks for this Ianthe. Your example makes it clear that we should have something like this, and I think having a regex component is probably a sensible route to take.

I want to examine a couple of things first.

  1. The name. Whatever the feature is called, it has to go into the developing ScraperJSON standard. patternProperties is quite long and, to my mind, isn't explicit about what it does. I would prefer either:
    • capture
    • regex
  2. In your code we extract only the first capture. Is this enough flexibility? Should we allow extracting all captures and performing some operation on them? I'm conscious of the need to balance power with simplicity.

Your thoughts on these welcome - I will sleep on them.

petermr commented 10 years ago

On Sun, Jun 22, 2014 at 12:52 AM, Richard Smith-Unna < notifications@github.com> wrote:

Thanks for this Ianthe. Your example makes it clear that we should have something like this, and I think having a regex component is probably a sensible route to take.

+1. I am developing something similar for extracting entity values.

I want to examine a couple of things first.

  1. The name. Whatever the feature is called, it has to go into the developing ScraperJSON standard. patternProperties is quite long and, to my mind, isn't explicit about what it does. I would prefer either:
    • capture
    • regex
      1. In your code we extract only the first capture. Is this enough flexibility? Should we allow extracting all captures and performing some operation on them? I'm conscious of the need to balance power with simplicity.

I agree that in principle we need to capture everything that fits a regex. It may be valuable to report these all, or sometimes just the count. It is also conceivable that in later developments there will be more than one capture group in which case we may need named capture groups (not universal in all regex implementations).

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069