Digital Bodleian (Oxford University)

Source API Endpoint / Documentation

https://digital.bodleian.ox.ac.uk

Provider description

We need to reach out to the digital collections manager's contact information (on this page) to see about an API endpoint we could use. The advanced search does not have UI for searching by rights.

The collection is high definition digitisations of library collection items. Not everything has open licensing but plenty do. There is everything from photography, manuscripts, music scores, and others, from all different origins across the world. Item descriptions are often detailed and of very high quality.

Licenses Provided

Various CC licences

Provider API Technical info

~See above, we'd need to develop a partnership with Oxford's library to accomplish this one.~

Update: Bodleian implements IIIF, a standard format for image description APIs: https://digital.bodleian.ox.ac.uk/developer/data/. They ostensibly use version 3 of the presentation API, but individual results link to version 2. They definitely do follow Dublin Core for terms definitions. That information is derived from the context endpoint: https://digital.bodleian.ox.ac.uk/api/1/context.json and their link to https://iiif.io/api/presentation/2/context.json on individual work manifests. This doesn't matter too much, other than that we can use it to get a general idea of the API shape. I've gone ahead and clarified some things below anyway though.

We can paginate through the search results by passing a sort=published asc query parameter. There are currently 20177 items. Page size can be 20, 40, or 100. If we took an extremely conservative approach to request throttling and made one request every 4 seconds, it would still take less than a day to work through the entire collection (20177 * 4 / 60 / 60 = about 22 and a half hours). We need to check each member of each page's manifest endpoint for the attribution field OR a requiredStatement where the value includes a mention of a CC licence: Bodleian uses both to designate openly licensed media. For example, this item uses requiredStatement: https://iiif.bodleian.ox.ac.uk/iiif/manifest/76528dc3-dfe6-4c28-ae5f-d1b2e61fa638.json

Whereas this item uses attribution: https://iiif.bodleian.ox.ac.uk/iiif/manifest/60834383-7146-41ab-bfe1-48ee97bc04be.json

Here is an example of an item that is not openly licensed: https://iiif.bodleian.ox.ac.uk/iiif/manifest/441db95d-cdff-472e-bb2d-b46f043db82d.json

After the first DAG run, subsequent DAG runs could skip if the total item count hasn't changed. We can also safely skip works we've already checked (regardless of whether they had an open licence), to reduce double work and load on their API in future iterations. This DAG would probably only realistically need to run once a month, maybe even only once every 3 months. If requested we could run it on demand as well, because it would not be a dated DAG.

Each work has a list of metadata, many of which should be incorporated into our search, potentially via tags, but we should also save it into our regular meta_data blob for future reference. For example the Author metadata represents zero to many known authors of the digitised work (the digitisation should be attributed to the attribution field or using the "Terms of Use" requiredStatement, not the authors. The original authors of a work (e.g., a 15th century monk) is not the rights holder for the digitisation, it's whoever is in the attribution field!). Title is also found in the metadata. Additional interesting metadata that could be useful for search: Provenance, Binding, Collection, Holding Institution, Materials, Date Statement, Language... and so on.

Each work has a thumbnail.

Checklist to complete before beginning development

[x] Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
[x] Verify the API provides license info (license type and version; license URL provides both, and is preferred)
[x] Verify the API provides stable direct links to individual works.
[x] Verify the API provides a stable landing page URL to individual works.
[x] Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
[x] Attach example responses to API queries that have the relevant info.

Implementation

[ ] 🙋 I would be interested in implementing this feature.

WordPress / openverse