CenterForOpenScience / scrapi

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. This is part of the SHARE project, and will be used to create a free and open dataset of research (meta)data. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki
Apache License 2.0
41 stars 45 forks source link

Feature/add elife #455

Closed kms6bn closed 8 years ago

kms6bn commented 8 years ago

@fabianvf wanted to have you take a look at this before you head out tomorrow - any feedback would be appreciated!

Also - right now, I have that it just goes to the first page of commits, but if you searched over all time - there are a total of 3 pages and the scraper would only read the first one. Not sure how to deal with that. I also had difficulty getting the license into the correct format.

fabianvf commented 8 years ago

First round done :+1: Check out the failing test on travis as well, it looks like it's just a string handling error.

kms6bn commented 8 years ago

This script is now updated to read as many pages as there is results. Currently to get this to run multiple pages I have to input my username and personal access token (you can only access 60 requests per hour without using this, which is not sufficient for larger queries of time, like a 6 month window for example).

    $ curl -u username:token https://api.github.com/repos/elifesciences/elife-articles/commits

@erinspace you mentioned that you can have this script run before elife is scraped, let me know if you need me to resend the api token information!

kms6bn commented 8 years ago

@erinspace just wanted to mention that you will need to incorporate the script above and the github API key that I emailed you a week or so back, let me know if you have issues!

erinspace commented 8 years ago

@kms6bn yep absolutely, thank you for the reminder! EXCELLENT work on this one, super exciting that it's ready to go live!