benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
171 stars 51 forks source link

not importing full first page of collection manifest #4348

Open saracarl opened 1 month ago

saracarl commented 1 month ago

When TN imported the following collection: https://cdm15138.contentdm.oclc.org/iiif/p15138coll54/manifest.json

They only got around 450 items. We would have expected the first page of items -- 1000 of them.

We need to reproduce and see why they didn't get the full first page.

WillNigel23 commented 1 month ago

We have our direction for the short-term patch

Ideas for the long-term patch:

Regardless of however much we import:

  1. All sc_collection object will have at most 100 links
  2. This means, if user has selected more than 100, we break it off to multiple sc_collections pointing to the same collection.
  3. We then kick off multiple rake tasks each corresponding to sc_collections. 3.1 As a sub feature, these should have better logging
saracarl commented 1 month ago

We also need the import all feature as part of this. I wonder if we should gather all the pages and recurse on each page to separate into 100 item sets.

So what's the implications of 8500 rake tasks at once?

WillNigel23 commented 1 month ago

Well, performance hits of course. That is why we need active_job so that we can queue jobs instead.

WillNigel23 commented 1 month ago

I think, as initial feature, we should always recurse to get all page though

benwbrum commented 1 month ago

This script will pull everything from a paginated manifest into a single manifest.

Note that the 85 fetches to OCLC take a few minutes, so if we productize this, it should be backgrounded and not run in a browser request.

uri = 'https://cdm15138.contentdm.oclc.org/iiif/2/p15138coll54/manifest.json'
raw_json = URI.open(uri).read
hash = JSON.parse(raw_json)
manifests=[]
page_uri = hash['first']

while page_uri
  p page_uri
  raw_json = URI.open(page_uri).read
  hash = JSON.parse(raw_json)
  manifests += hash['manifests']
  page_uri = hash['next']
end

uri = 'https://cdm15138.contentdm.oclc.org/iiif/2/p15138coll54/manifest.json'
raw_json = URI.open(uri).read
hash = JSON.parse(raw_json)
hash['label']="FromThePage consolidated Tennessee Death Records"
hash['manifests']=manifests
hash.delete('first')

f=File.open("/tmp/big_oclc_manifest.json", 'w+')
f.print(hash.to_json)
f.close

big_oclc_manifest.json