ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

Scraper design spec #3

Closed blahah closed 10 years ago

blahah commented 10 years ago

I think the following are crucial elements for a successful scraper system:

petermr commented 10 years ago

Thanks, Richard

Sounds great.

On Wed, Apr 30, 2014 at 11:26 AM, Richard Smith-Unna < notifications@github.com> wrote:

I think the following are crucial elements for a successful scraper system:

  • Scrapers are declaratively defined in a data format that is technology agnostic, e.g. JSON, YAML, XML

Yes. The declarative nature means it can be developed outside the deployment system, documented, searched, compared, etc.

  • The scraping technology supports xpath/css selectors, post-processing extractions, and combining captures in complex constructions.

Yes. What does "combining captures in complex constructions" mean?

  • Community members can define new scrapers using a browser-based GUI that exposes all the power of the underlying technology and automatically submits defined scrapers for review before inclusion in the repository.

Sounds great

  • We have automated testing of scraper pull-requests through Travis, using a random subsample from a predefined set of test pages.

Travis?

  • The scraper is capable of browsing with sessions, cookies, graphical and JS rendering, and populating dynamic content, as well as waiting for and triggering on these events. Realistically, I think running in a headless browser is the only way to do this.

Happy to agree with you. Looking at a list here : https://gist.github.com/evandrix/3694955

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069