Web interface / browser plugin for interactive scraper creation

ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework

66 stars 33 forks source link

Web interface / browser plugin for interactive scraper creation #13

Open blahah opened 10 years ago

blahah commented 10 years ago

Take ideas from

PeerLibrary

cc @mitar

mitar commented 10 years ago

So the idea is to use a similar process that we use for annotation to let user highlight parts of a page and then we can store that as an open annotation standard targets. But instead of attaching annotations, we would use it to extract data.

This could be then integrated with nice user interface, maybe reusing parts of this feedback tool, or simply Annotator.

mitar commented 10 years ago

So this is related also to the question how to define scrappers. I would advise using Xpath as only one of available options. I think storing also other information similar to open annotation standard would be helpful:

offsets in the page
prefix/suffix + regex to match the content (instead of direct quote as used in annotations)
xpath
DOM path (my addition)