ContentMine / scraperJSON

The scraperJSON standard for defining web scrapers as JSON objects
Creative Commons Zero v1.0 Universal
33 stars 2 forks source link

Create example scrapers with example results #10

Open klartext opened 9 years ago

klartext commented 9 years ago

It would be nice to have real-world example-json files together with directory/file-collection, which are created by running a scraper with a certain scraperJSON-json file.

That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.

A .zip or .tgz file for results (or json-file and results) would make sense as examples, IMHO.

petermr commented 9 years ago

Chris (copied) has written a tutorial on this which we released last week and this should provide what you want. Chris, can you point to this and see if it's what is wanted? thx

On Sat, Aug 8, 2015 at 3:57 PM, klartext notifications@github.com wrote:

It would be nice to have real-world example-json files together with directory/file-collection, which are created by running a scraper with a certain scraperJSON-json file.

That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.

A .zip or .tgz file for results (or json-file and results) would make sense as examples, IMHO.

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/scraperJSON/issues/10.

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 9 years ago

@klartext

blahah commented 9 years ago

Ah, I now realise that you perhaps mean that the example scrapers should come with example results - is that the case? If so, that's an excellent idea and I will make it a priority.

klartext commented 9 years ago

Yes, I meant examples that has three things:

Just fom the json-files it's not clear, how to interpret them. Is ee a lot of selector-strings like "//meta[@name='citation_publisher']" but how is that be used? Is this an OPTION to select via cli/gui/... or what does this mean? So there is some ambiguity in interpreteing the json files. So a realworld example (e.g. element/selector foobar is used with cli-switc foobar ??? and searchkeywords are e.g. "horizontal._gene._transfer" or so, and this results in a dir with a pdf....)

petermr commented 9 years ago

@klartext

Or recent workshop is covered in:

https://github.com/ContentMine/workshop-resources/

You will find many of your questions addressed here. Suggest you work though the getpapers ans Scraper tutorials and let us know if what you want is not there.

On Sat, Aug 8, 2015 at 9:21 PM, klartext notifications@github.com wrote:

Yes, I meant examples that has three things:

  • json-file
  • realworld example (what paper is downloaded and how - e.g. example data that is input via file/stdin/cli/gui to select those papers that want to be downloaded)
  • realworld results in form of example directory-with-downloaded-content

Just fom the json-files it's not clear, how to interpret them. Is ee a lot of selector-strings like "//meta[@name https://github.com/name='citation_publisher']" but how is that be used? Is this an OPTION to select via cli/gui/... or what does this mean? So there is some ambiguity in interpreteing the json files. So a realworld example (e.g. element/selector foobar is used with cli-switc foobar ??? and searchkeywords are e.g. "horizontal._gene._transfer" or so, and this results in a dir with a pdf....)

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/scraperJSON/issues/10#issuecomment-129042500 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

klartext commented 9 years ago

@petermr OK, I read the "getpapers-tutorial.md" from the workshop-ressources. This has explained a lot and also answered some questions.

But where does scraperJSON fit in? That is not explained, and I could not find json-files for getpaper. Is scraperJSON just a new idea for newer/planned scrapers? Or is it used somewhere already? And if so: where, and how to interpret it?

blahah commented 9 years ago

@klartext yes, I think we need some overview documentation.

getpapers and quickscrape are both tools you can use to get scientific papers en masse for content mining.

getpapers allows you to search for papers on EuropePubMed, ArXiv or IEEE. You get metadata of all the hits to your query. You can optionally also try to download PDF, XML and/or supplementary data for the hits, but not all papers are downloadable this way.

quickscrape is a web scraping tool. You give it a URL and a scraper definition (in scraperJSON format) and it will scrape the URL using the scraper definition to guide it. We have a collection of scraperJSON definitions for major publishers and journals over at journal-scrapers. quickscrape is useful when you want to download things that (a) were in your getpapers results but getpapers couldn't get the PDF/XML/supp or (b) are not contained in the source databases that getpapers uses.

So, getpapers can data, very fast, from a subset of the literature. Quickscrape can get the same data, it takes more work but you can theoretically use it on any page on the internet.

petermr commented 9 years ago

@klartext as Richard says you need to read about quickscrape. The workshop tutorial is at https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/quickscrape which should give a reasonable introduction to quickscrape and the format of scrapers.

On Sun, Aug 9, 2015 at 2:31 AM, klartext notifications@github.com wrote:

@petermr https://github.com/petermr OK, I read the "getpapers-tutorial.md" from the workshop-ressources. This has explained a lot and also answered some questions.

But where does scraperJSON fit in? That is not explained, and I could not find json-files for getpaper. Is scraperJSON just a new idea for newer/planned scrapers? Or is it used somewhere already? And if so: where, and how to interpret it?

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/scraperJSON/issues/10#issuecomment-129081569 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

klartext commented 9 years ago

@petermr: OK, I read the quickscrape-tuorial. It explains how it works. But how to create / interpret the jsonSCRAPER-files? (Should print that question in a loop?) The link to "create your own definitions" gives me 404-error. The link to ctree also gives me 404-error.

blahah commented 9 years ago

@klartext how about this: https://github.com/ContentMine/ebi_workshop_20141006/tree/master/sessions/6_scrapers

It's the session on scrapers that I wrote for a workshop last year. It includes guides on creating selectors, as well as basic and advanced scraperJSON.

We need to update some of these resources as there are now more features available, but that should get you started.

petermr commented 9 years ago

@klartext, Thanks for this. Your engagement helps to drive our documentation and also shows up places where we have a need to develop software.

"But how to create / interpret the jsonSCRAPER-files? " Probably trivial point: we expect people to use a text editor, probably starting with a generic scraper template/example. Not sure whether it's worth developing a specific tool.

If you aren't familiar with XPath see https://en.wikipedia.org/wiki/XPath. (We use version 1.0). There are also many online tutorials and some are interactive.

On Sun, Aug 9, 2015 at 10:15 PM, klartext notifications@github.com wrote:

@petermr https://github.com/petermr: OK, I read the quickscrape-tuorial. It explains how it works. But how to create / interpret the jsonSCRAPER-files? (Should print that question in a loop?) The link to "create your own definitions" gives me 404-error. The link to ctree also gives me 404-error.

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/scraperJSON/issues/10#issuecomment-129241063 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

klartext commented 9 years ago

@Blahah thanks, that text did help a lot. I was not knowing XPATH-stuff in detail, so the selector-syntax looked strange to me. After reading that, I saw, what the definition is about. So, the selector selects tags/data from html-pages, and what is selected, is written in XPATH-syntax. Some other things could be explained in the scraperJSON-docs: for example (from example in scraperJSON-doc) "attribute": "content". I guess, this means the part between opening and closing tag, like "ThisStuff" in "ThisStuff", would mean "content" here...

klartext commented 9 years ago

@petermr regarding the documentation: some links are going to nirvana, and some pics are not available. Also, some *.md files are just empty. Regarding the need of software: what kind of software do you mean? What did my comments show up as new software-need? Isn't the content-mine stuff already a working collection of tools?

BTW: I saw one of your presentations. Very interesting, that the documents also will get analysed. This is reverse-engineering of pdf's. Hardcore! :-) I have some pdf-stuff that I would like to analyze that way (the papers /archive from the BCL (Biological Computer Laboratory, https://en.wikipedia.org/wiki/Biological_Computer_Laboratory)). I thought about using tesseract-ocr, but it seems, ContentMine already provides a workflow/tools to do it easier.

Probably trivial point: we expect people to use a text editor, probably starting with a generic scraper template/example. Not sure whether it's worth developing a specific tool.

Well, I already have a tool, which is quite generically and can achieve, what getpapers and quickscrape offer as seperate tools. (but I have no javascript-engine in the background, at least now) See here: https://github.com/klartext/any-dl

Regarding XPATH: yes, I was not familiar with it; I looked for this Wikipedia-article by myself, after I read Blahah's "02_creating_selectors.md" introduction. Thanks for pointing me there too. You identified the "missing link" ;-)

From the scraperJSON-example: "selector": "//meta[@name='citation_pdf_url']", "attribute": "content",

This, I think, translates into tagselect( "meta"."name"="citation_pdf_url" | data ); of any-dl syntax.

From the "02_creating_selectors.md"-doc: //dl[@class='article-license']//span[@class='license-p'] would translate in any-dl to: tagselect( "dl"."class"="article-license", "span"."class"="license-p" | ... );

where "..." must be substituted with the specification of what to pick out (e.g. data, arg("foobar"), etc.).

Hope this clarifies, why I asked for interpretation of the json-files. But even without having my own tools in mind, I would recommend, to not only mention XPATH in the scraperJSON-doc (mentioned only once), but also to add the link to the wikipedia-article there.

The best explanation of the json-files was in the "02_creating_selectors.md" text. In the scraperJSON-description there are a lot of links to tools, but not the link to "02_creating_selectors.md". At least for me, it worked as distraction to have links to the tools, because when reading about the syntax/format, I would like to know more about it; what tools it use does not help so much. (But as explanation, why scrapeJSON was developed, this may help. So other people may find it helpful.) But the link to "02_creating_selectors.md" would (and did) really help in understanding scraperJSON!

So, I recommend adding a link to "https://github.com/ContentMine/ebi_workshop_20141006/blob/master/sessions/6_scrapers/02_creating_selectors.md" to the document "https://github.com/ContentMine/scraperJSON", because that explains how the scraperJSON "works". The tools then are examples of how/where the scrapeJSON is used. But for people, who wants to understand the format itself, that is secondary, I think. (And I think that is not only from a programmers view.)

I hope this possibly too-long answer contributes to enhancing the docs.

P.S.: Realworld-examples with results (including e.g. tar.gz of result-directory), as mentioned in the beginning of the thread, of course would help understanding too. Different people, different ways to learn... ...also such results could be used for testing purposes. A "diff -r" on the directories could be used to check results of different tools, or different versions of one tool. Just as an idea...

klartext commented 9 years ago

Another unclear point: what is done with all those elements? Not all are for download, so where does the information go to? Will the scraped information be saved as a json-file, as metadata? At first I thought, they also have something to do with paper-selection.

But now it seems to me, that getpapers does paper-selection (give a search-query and you get a url-list), while quickscraper is using that list and just downloads the files. As only quickscraper uses the scraperJSON-definitions, the paper-URLs then are already known. So the scraperJSON-elements seem not to be used as paper-selectors, but are just informations, that can be gathered about a paper, and can - or will - be saved together with the documents?

An overview-doc for orientation would be nice. A graphics could help a lot, IMHO.

petermr commented 9 years ago

Great - discussions like this are a significant way of taking things forward...

On Mon, Aug 10, 2015 at 11:01 AM, klartext notifications@github.com wrote:

@petermr https://github.com/petermr regarding the documentation: some links are going to nirvana, and some pics are not available. Also, some *.md files are just empty.

I have copied in Chris Kittel who oversees the documentation

Regarding the need of software: what kind of software do you mean? What did my comments show up as new software-need? Isn't the content-mine stuff already a working collection of tools?

The publisher formats are very variable and in theory we need a new scraper for each one. In practice much of this is normalised. So today we had a need to scrape IJSEM which is an Ingenta journal. that might required new software though RSU thinks his is generic enough to cover it. But there is always the chance we may need something new

BTW: I saw one of your presentations. Very interesting, that the documents also will get analysed. This is reverse-engineering of pdf's. Hardcore! :-)

Certainly hard work!

I have some pdf-stuff that I would like to analyze that way (the papers /archive from the BCL (Biological Computer Laboratory, https://en.wikipedia.org/wiki/Biological_Computer_Laboratory)).

This is exciting and valuable but challenging. My guess is that much of it is PDFs of OCR scans. Some of this is probably typewritten (even carbon copy) , some may be print (with hot-metal). If this is scanned the results are very variable.

I thought about using tesseract-ocr, but it seems, ContentMine already provides a workflow/tools to do it easier.

No, we use Tesseract :-). We are about to measure the OCR-error rate. For born digital PDFs I am reckoning ca 1% character error BUT these do not suffer from:

So it depends what you want to get from it. I am afraid we normally warn people that this is very adventurous and will take a lot of their time.

ATH: yes, I was not familiar with it; I looked for this Wikipedia-article

by myself, after I read Blahah's "02_creating_selectors.md" introduction.

Thanks for pointing me there too. You identified the "missing link" ;-)

We ahve to pur that right, @CK

From the scraperJSON-example: "selector": "//meta[@name https://github.com/name='citation_pdf_url']", "attribute": "content",

This, I think, translates into tagselect( "meta"."name"="citation_pdf_url" | data ); of any-dl syntax.

From the "02_creating_selectors.md"-doc: //dl[@class https://github.com/class='article-license']//span[@class https://github.com/class='license-p'] would translate in any-dl to: tagselect( "dl"."class"="article-license", "span"."class"="license-p" | ... );

where "..." must be substituted with the specification of what to pick out (e.g. data, arg("foobar"), etc.).

Hope this clarifies, why I asked for interpretation of the json-files. But even without having my own tools in mind, I would recommend, to not only mention XPATH in the scraperJSON-doc (mentioned only once), but also to add the link to the wikipedia-article there.

We need a tutorial ChrisK

The best explanation of the json-files was in the " 02_creating_selectors.md" text.

In the scraperJSON-description there are a lot of links to tools, but not the link to "02_creating_selectors.md". At least for me, it worked as distraction to have links to the tools, because when reading about the syntax/format, I would like to know more about it; what tools it use does not help so much. (But as explanation, why scrapeJSON was developed, this may help. So other people may find it helpful.) But the link to "02_creating_selectors.md" would (and did) really help in understanding scraperJSON!

So, I recommend adding a link to " https://github.com/ContentMine/ebi_workshop_20141006/blob/master/sessions/6_scrapers/02_creating_selectors.md " to the document "https://github.com/ContentMine/scraperJSON", because that explains how the scraperJSON "works". The tools then are examples of how/where the scrapeJSON is used. But for people, who wants to understand the format itself, that is secondary, I think. (And I think that is not only from a programmers view.)

I hope this possibly too-long answer contributes to enhancing the docs.

It's great. We need to know what your and others want and then cater for them

P.S.: Realworld-examples with results (including e.g. tar.gz of result-directory), as mentioned in the beginning of the thread, of course would help understanding too. Different people, different ways to learn... ...also such results could be used for testing purposes. A "diff -r" on the directories could be used to check results of different tools, or different versions of one tool. Just as an idea...

Yes. As far as possible we try to have Test-Driven-Development where we have to check against expected results. Unfortunately this is so dependent on the original source , tests become fragile.

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/scraperJSON/issues/10#issuecomment-129392064 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

chreman commented 9 years ago

@klartext Just now we released a new tutorial on creating scraper-definitions, I hope this covers some of your questions. Feeback is highly appreciated!

petermr commented 9 years ago

Chris Kittel has just posted a draft tutorial on scrapers:

https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/journal-scrapers/journal-scrapers-tutorial.md

I think it would be very useful for him if you made comments.

P.

On Sat, Aug 8, 2015 at 7:13 PM, Richard Smith-Unna <notifications@github.com

wrote:

@klartext https://github.com/klartext

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/scraperJSON/issues/10#issuecomment-129030067 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

klartext commented 9 years ago

Hi,

this Tutorial is very good. I wanted to give more detailed feedback, but have some lack of time. So I did not read it complete, but until "Followables".

It's a good starting point, adressing many questions.

When I have time to read the rest of the document, I can give more feedback, sending my notes. Some things can be enhanced.

chreman commented 9 years ago

Thank you, we're looking forward to your suggestions.