Evaluate scraping frameworks and decide whether we need our own

blahah commented 10 years ago

Key to the mission is building a content mining system that can be sustained by a community of volunteers. To maximise sustainability we should choose or design the scraping system so that the barrier to becoming a contributor is a low as practically possible.

Existing scrapers fall into the following categories:

Use a structured scraping framework but with each definition implemented in arbitrary code confirming to a basic protocol. Example: Zotero translators.
Use a Domain Specific Language to create definitions. Example: scrapi.
Learn the definition from user provided examples of paired documents and extracted datasets. Example: scrapely.
Use declarative, structured definitions in a data format like JSON. Example: libKrake.

petermr commented 10 years ago

This is a great collection...

I'm haven't looked in detail but in principles am happy to build on these rather than invent something from scratch. I am a believer in DSL's. They make the same distinctions we between crawler and scraper so we can agree on the terminology.

On Sun, Apr 27, 2014 at 6:59 PM, Richard Smith-Unna < notifications@github.com> wrote:

Key to the mission is building a content mining system that can be sustained by a community of volunteers. To maximise sustainability we should choose or design the scraping system so that the barrier to becoming a contributor is a low as practically possible.

Existing scrapers fall into the following categories:

Use a structured scraping framework but with each definition implemented in arbitrary code confirming to a basic protocol. Example: Zotero translators https://github.com/zotero/translators.

Use a Domain Specific Language to create definitions. Example: scrapihttps://github.com/assaf/scrapi .

Learn the definition from user provided examples of paired documents and extracted datasets. Example: scrapelyhttps://github.com/scrapy/scrapely .

Use declarative, structured definitions in a data format like JSON. Example: libKrake https://github.com/KrakeIO/libkrake.

— Reply to this email directly or view it on GitHubhttps://github.com/ContentMine/journal-scrapers/issues/2 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 10 years ago

OK I've been testing the examples listed above and some other options. Below I've evaluated them based on features I think will be important to both technical and community success. I'll flesh out this table as I complete more tests, and please chip in with other features I should look for or options I should test.

	Kimono Labs	Zotero translators	libKrake	Krake Chrome Extension	Scrapy	Scrapely	Portia
No contributor programming required	:white_check_mark:	:x:	:white_check_mark:	:white_check_mark:	:x:	:white_check_mark:	:white_check_mark:
Can postprocess captured elements (e.g. regex)	:white_check_mark:	:white_check_mark:	:white_check_mark:	:x:	:white_check_mark:	:white_check_mark:
Follows a structured schema	:white_check_mark:	:x:	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Allows capturing invisible elements	:x:	:white_check_mark:	:white_check_mark:	:x:	:white_check_mark:	:white_check_mark:
Can define through a GUI	:white_check_mark:	:x:	:x:	:white_check_mark:	:x:	:x:	:white_check_mark:
Allows capturing complex elements	:x:	:white_check_mark:	:white_check_mark:	:x:	:white_check_mark:	:x:
Accurate						:x:
Robust						:x:
FLOSS	:x:	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:

blahah commented 10 years ago

Based on the above, and pending completion of my accuracy/robustness testing, I think we can narrow our options down to the following:

start from the Scrapy/Scrapely/Portia stack and extend Portia to accomodate our specific needs (see #3)
start from the libKrake/Krake-extension stack and extend the browser plugin to accomodate our needs
extend scrapi/nokogiri to be declarative and build our own frontend

Unfortunately there's no existing system, that I can find, that we could use without extension.

Of the three options laid out above, I think Krake as a basis for modification gets us furthest towards our goals.

blahah commented 10 years ago

This is done - we ended up needing our own. Krake was really nice and their approach inspired the direction I went in, but ultimately I felt that nothing pre-existing was powerful enough for what we wanted. I've created our own tools for this, e..g https://github.com/ContentMine/quickscrape.

kanzure commented 8 years ago

Zotero translators can be repurposed. The way that I did this for paperbot ( https://github.com/kanzure/paperbot ) was through a local "headless" zotero server, which provides a simple API for executing the translators (scrapers). The advantage of this technique is that zotero's 100s of scrapers are constantly maintained by a great number of people.

blahah commented 8 years ago

Indeed, and I considered this, but unfortunately the Zotero scrapers don't cover all the information we need. We would have to edit all the scrapers in the collection. What we've ended up with is a declarative framework that is much more powerful than the Zotero translators. Because it's declarative, we can build tools on top of it, for example tools that allow non-programmers to easily build scrapers (we have this currently in development).

Also worth noting is that our tool getpapers, which uses various APIs to source papers and their associated (meta)data, can handle many things. Support for the crossref API is in the todo queue. With that, getpapers will be able to get most papers.

btw paperbot is very nice!

kanzure commented 8 years ago

I think that the zotero team can be convinced to switch to a different format if you can show that the alternative scrapers are just as comprehensive (if not more).

Yeah there's a large corpus of data from multiple years of active paperbot use, dunno if that would be useful to you. Mostly the available data is in the format of "requested paper url" (some abstract url) followed by paperbot either succeeding and returning a pdf, or failing and returning the html of a page that it couldn't parse at the time. Might be useful for unit testing datafodder, or for future studies perhaps interesting to see evolution of terrifying html and javascript on publisher sites.

blahah commented 8 years ago

Good to know about Zotero - once our collection is at their level, we will explore that.

Is the paperbot corpus all open access stuff?

Looking at changes in publisher websites over time could be very interesting.

kanzure commented 8 years ago

Mixed access, some open-access.

ContentMine / journal-scrapers

Evaluate scraping frameworks and decide whether we need our own #2