WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
244 stars 195 forks source link

Digital Bodleian (Oxford University) #2992

Open sarayourfriend opened 1 year ago

sarayourfriend commented 1 year ago

Source API Endpoint / Documentation

https://digital.bodleian.ox.ac.uk

Provider description

We need to reach out to the digital collections manager's contact information (on this page) to see about an API endpoint we could use. The advanced search does not have UI for searching by rights.

The collection is high definition digitisations of library collection items. Not everything has open licensing but plenty do. There is everything from photography, manuscripts, music scores, and others, from all different origins across the world. Item descriptions are often detailed and of very high quality.

Licenses Provided

Various CC licences

Provider API Technical info

~See above, we'd need to develop a partnership with Oxford's library to accomplish this one.~

Update: Bodleian implements IIIF, a standard format for image description APIs: https://digital.bodleian.ox.ac.uk/developer/data/. They ostensibly use version 3 of the presentation API, but individual results link to version 2. They definitely do follow Dublin Core for terms definitions. That information is derived from the context endpoint: https://digital.bodleian.ox.ac.uk/api/1/context.json and their link to https://iiif.io/api/presentation/2/context.json on individual work manifests. This doesn't matter too much, other than that we can use it to get a general idea of the API shape. I've gone ahead and clarified some things below anyway though.

We can paginate through the search results by passing a sort=published asc query parameter. There are currently 20177 items. Page size can be 20, 40, or 100. If we took an extremely conservative approach to request throttling and made one request every 4 seconds, it would still take less than a day to work through the entire collection (20177 * 4 / 60 / 60 = about 22 and a half hours). We need to check each member of each page's manifest endpoint for the attribution field OR a requiredStatement where the value includes a mention of a CC licence: Bodleian uses both to designate openly licensed media. For example, this item uses requiredStatement: https://iiif.bodleian.ox.ac.uk/iiif/manifest/76528dc3-dfe6-4c28-ae5f-d1b2e61fa638.json

Whereas this item uses attribution: https://iiif.bodleian.ox.ac.uk/iiif/manifest/60834383-7146-41ab-bfe1-48ee97bc04be.json

Here is an example of an item that is not openly licensed: https://iiif.bodleian.ox.ac.uk/iiif/manifest/441db95d-cdff-472e-bb2d-b46f043db82d.json

After the first DAG run, subsequent DAG runs could skip if the total item count hasn't changed. We can also safely skip works we've already checked (regardless of whether they had an open licence), to reduce double work and load on their API in future iterations. This DAG would probably only realistically need to run once a month, maybe even only once every 3 months. If requested we could run it on demand as well, because it would not be a dated DAG.

Each work has a list of metadata, many of which should be incorporated into our search, potentially via tags, but we should also save it into our regular meta_data blob for future reference. For example the Author metadata represents zero to many known authors of the digitised work (the digitisation should be attributed to the attribution field or using the "Terms of Use" requiredStatement, not the authors. The original authors of a work (e.g., a 15th century monk) is not the rights holder for the digitisation, it's whoever is in the attribution field!). Title is also found in the metadata. Additional interesting metadata that could be useful for search: Provenance, Binding, Collection, Holding Institution, Materials, Date Statement, Language... and so on.

Each work has a thumbnail.

Checklist to complete before beginning development

Implementation

sarayourfriend commented 1 year ago

Added blocked status to indicate we need to reach out to the provider. To clarify: I have not yet done that and we don't necessarily have a good way to track that work. Any ideas @WordPress/openverse-maintainers for how best to track that kind of thing? I think there are other providers we need to follow up on too (like SLV). It'd be nice to have a dashboard of some kind that tracks the development of these partnerships.