Scraper design spec - Githubissues

I think the following are crucial elements for a successful scraper system:

Scrapers are declaratively defined in a data format that is technology agnostic, e.g. JSON, YAML, XML
The scraping technology supports xpath/css selectors, post-processing extractions, and combining captures in complex constructions.
Community members can define new scrapers using a browser-based GUI that exposes all the power of the underlying technology and automatically submits defined scrapers for review before inclusion in the repository.
We have automated testing of scraper pull-requests through Travis, using a random subsample from a predefined set of test pages.
The scraper is capable of browsing with sessions, cookies, graphical and JS rendering, and populating dynamic content, as well as waiting for and triggering on these events. Realistically, I think running in a headless browser is the only way to do this.

Thanks, Richard

Sounds great.

On Wed, Apr 30, 2014 at 11:26 AM, Richard Smith-Unna < notifications@github.com> wrote:

I think the following are crucial elements for a successful scraper system:

Scrapers are declaratively defined in a data format that is technology agnostic, e.g. JSON, YAML, XML

Yes. The declarative nature means it can be developed outside the deployment system, documented, searched, compared, etc.

The scraping technology supports xpath/css selectors, post-processing extractions, and combining captures in complex constructions.

Yes. What does "combining captures in complex constructions" mean?

Community members can define new scrapers using a browser-based GUI that exposes all the power of the underlying technology and automatically submits defined scrapers for review before inclusion in the repository.

Sounds great

We have automated testing of scraper pull-requests through Travis, using a random subsample from a predefined set of test pages.

Travis?

The scraper is capable of browsing with sessions, cookies, graphical and JS rendering, and populating dynamic content, as well as waiting for and triggering on these events. Realistically, I think running in a headless browser is the only way to do this.

Happy to agree with you. Looking at a list here : https://gist.github.com/evandrix/3694955

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

ContentMine / journal-scrapers

Scraper design spec #3