CenterForOpenScience / scrapi

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. This is part of the SHARE project, and will be used to create a free and open dataset of research (meta)data. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki
Apache License 2.0
41 stars 45 forks source link

Feature/ucar #439

Closed kms6bn closed 8 years ago

kms6bn commented 8 years ago

add ucaropensky as separate request

fabianvf commented 8 years ago

So I saw at least one identifier that was listed as doi:opensky.ucar.edu:archives_amsohp. This isn't actually a DOI, so far as I can tell, but you can find the resource it points to here: https://opensky.ucar.edu/islandora/object/archives:amsohp . Might want to check out if this is a pattern or not

kms6bn commented 8 years ago

Can you clarify what the problem is? Is it missing documents?

fabianvf commented 8 years ago

So, the main problem with this is that there is too much data in the VCR, so the test takes longer than 10 minutes to run and travis kills the build. I went to look for days with less data, but noticed that on those days there were some documents with no resolvable identifiers, which crashed the harvester. This is a second, separate issue, that might require you to look more into how the harvester works and update the schema to take into account the alternate way URLs are presented through this API

kms6bn commented 8 years ago

there is an issue with the resumption tokens - this is on hold for now

fabianvf commented 8 years ago

Can you also tag the JIRA issue with ExternalBlocker?

kms6bn commented 8 years ago

Yep - I added the tag.

erinspace commented 8 years ago

Oh also @kms6bn can you write up a summary of the issue so I can write them to let them know?

kms6bn commented 8 years ago

Issue summary:

"Say I request metadata from November 30th 2015 through December 1st 2015. We are correctly taken to http://opensky.ucar.edu/oai2?verb=ListRecords&metadataPrefix=oai_dc&from=2015-11-30T00:00:00Z&until=2015-12-02T00:00:00Z , which has 31 records. However, once our harvester follows the resumption token, it is sent to http://opensky.ucar.edu/oai2?verb=ListRecords&resumptionToken=1390319813 , which holds a complete list of 28,635 records."

@fabianvf feel free to add!

kms6bn commented 8 years ago

@erinspace Did they ever respond to your question re: resumption tokens? I just checked today, and they are still broken.

kms6bn commented 8 years ago

not sure why travis is failing "The command "invoke wheelhouse --develop" failed. Retrying, 2 of 3."