ContentMine / journal-scrapers

Journal scraper definitions for the ContentMine framework
66 stars 33 forks source link

Added Psychological Science scraper #40

Closed chartgerink closed 8 years ago

chartgerink commented 9 years ago

Hi,

I attempted to write my first scraper, according to your scraperJSON template, and succeeded for the most part. I have also included test links. I tried to scrape as much information as possible, and include some of my problems below, FYI.

Kind regards, Chris

  1. Introduction is not a defined section but just includes paragraph numbers (SAGE thing..)
  2. Supplementary materials are included at a separate location AND include all files of one issue. Have not discovered an easy way to download these (also a SAGE thing..)
  3. I have not yet succeeded in downloading Figures and tables.
petermr commented 9 years ago

Many thanks Chris,

  1. Well done - it is very exciting to have people like you developing things and we can share our experience and technology.
  2. I am looking actively at para numbers but they aren't yet ready for alpha.
  3. This looks messy. We may have to download and split them
  4. are these downloadable manually? i.e. we could have a "follow" strategy?
  5. what format do you work with? HTML or PDF (I assume not XML).
chartgerink commented 9 years ago

I typically work with HTML files.

Wrt the figures and tables, these are downloadable but I am unsure how to incorporate this. For example, if there is a paper with 3 figures, the link to the images is the scraping url appended with /F#.large.jpg for figures or T#.large.jpg for tables. So, if we can find a way to identify the number of Figs and tables in a paper and feed that into a download link, that would work I think. I don't know how to do this in JSON at the moment.

(note: the tables are provided as images...)

On Sat, Jul 25, 2015 at 11:07 AM, petermr notifications@github.com wrote:

Many thanks Chris,

  1. Well done - it is very exciting to have people like you developing things and we can share our experience and technology.
  2. I am looking actively at para numbers but they aren't yet ready for alpha.
  3. This looks messy. We may have to download and split them
  4. are these downloadable manually? i.e. we could have a "follow" strategy?
  5. what format do you work with? HTML or PDF (I assume not XML).

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/journal-scrapers/pull/40#issuecomment-124824639 .

petermr commented 9 years ago

Thanks Chris,

I typically work with HTML files.

The quality of HTML is very variable. It ranges from perfect (well-formed) XHTML to stuff full of Javascript and adverts. The general procedure is:

Wrt the figures and tables, these are downloadable but I am unsure how to incorporate this. For example, if there is a paper with 3 figures, the link to the images is the scraping url appended with `/F#.large.jpg` for figures or `T#.large.jpg` for tables. So, if we can find a way to identify the number of Figs and tables in a paper and feed that into a download link, that would work I think. I don't know how to do this in JSON at the moment.

We'll need Richard to comment.

(note: the tables are provided as images...)

ARGGH.

Things are still possible here. If you are really interested and are happy to contribute some hacking we have an alpha framework which can help.

It will help if you can show some examples. Can you find an Open Access example? If not let us know a typical page.

petermr commented 9 years ago

Chris, would you be happy to move this very important discussion to our DISCUS platform since it could see many more people there .

chartgerink commented 9 years ago

Yes, the cleaning/normalizing stage was next on my list. I thought I'd first familiarize myself with scraping first. By the DISCUS platform you mean the googlegroup listserv by the way?

An OA url for Psych.Science is http://pss.sagepub.com/content/early/2015/07/16/0956797615588467.full

I am most definitely interested in hacking something; full disclosure: my programming skills are limited. I am very willing to learn, but it might take me a while.

On Sat, Jul 25, 2015 at 12:19 PM, petermr notifications@github.com wrote:

Chris, would you be happy to move this very important discussion to our DISCUS platform since it could see many more people there .

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/journal-scrapers/pull/40#issuecomment-124831777 .

petermr commented 9 years ago

Chris,

The PDF itself is very tractable!. The figures are drawn with vectors and can therefore, with modest work on our part, be automatically turned into tables. We can extracts the points from the scatterplot. The tables use characters and so should be tractable, though probably requiring a template specific for SAGE.

This paper is a really exciting example of something that can be largely read automatically. (The additional figures and tables are actually worse than the PDF.)

I am most definitely interested in hacking something; full disclosure: my programming skills are limited. I am very willing to learn, but it might take me a while.

Your R skills are more than enough. We have an Open platform where your contribution will be to adjust parameters and provide templates, not write hairy code.

Would you be interested in being the focus of a sub-community based round extracting data for psychology? It would be really valuable and I suspect would bring in collaborators. Since you are already well experienced in the R community there are many principles you will already have acquired. The main attribute is , of course, commitment, We can help with the technology and general social tools. If so, suggest you mail me and we'll continue offline.

P.

GrahamSteel commented 9 years ago

Chris.

I have set up a page on DISCUS platform for this:- http://discuss.contentmine.org/t/how-to-scrape-process-various-publishers/48

tarrow commented 8 years ago

I rewrote this into #44 with a rebase because it could be merged to the master (it needed a rebase) before merging the 176 commits

chartgerink commented 8 years ago

I did some additional checking of the scrapers. I removed tf.json because it conflicted with taylorfrancis.json (which is a clearer filename, I think) and because taylorfrancis.json performed better. I incorporated the code from tf.json for the tables. TaylorFrancis really acts oddly, so we need to check that at some point.

Also checked and updated wiley, sage, springer, elsevier scrapers. Elsevier contains almost no metadata so the scraper only uses html and pdf extraction. I also incorporated some changes, but they eliminated a lot of metadata scraping and did renaming of elements. Are we still adhering to the scraperJSON standard or did that become a thing of the past?

Sorry for the extent of commits, I forgot about this. I can also create a new fork to make things easier and do a new PR. Let me know.

tarrow commented 8 years ago

We're still adhering to the scraperJSON standard; but not least because given that QS and thresher are the reference implementations basically if it works it's scraperJSON :).

If you could create a new branch from the current origin/master and cherrypick over these changes you've just made that would be awesome! Otherwise I can do that and make another PR. Let me know if you have problems.

tarrow commented 8 years ago

Great! I'll merge now! :)