Open sarayourfriend opened 1 year ago
Added blocked status to indicate we need to reach out to the provider. To clarify: I have not yet done that and we don't necessarily have a good way to track that work. Any ideas @WordPress/openverse-maintainers for how best to track that kind of thing? I think there are other providers we need to follow up on too (like SLV). It'd be nice to have a dashboard of some kind that tracks the development of these partnerships.
Source API Endpoint / Documentation
https://digital.bodleian.ox.ac.uk
Provider description
We need to reach out to the digital collections manager's contact information (on this page) to see about an API endpoint we could use. The advanced search does not have UI for searching by rights.
The collection is high definition digitisations of library collection items. Not everything has open licensing but plenty do. There is everything from photography, manuscripts, music scores, and others, from all different origins across the world. Item descriptions are often detailed and of very high quality.
Licenses Provided
Various CC licences
Provider API Technical info
~See above, we'd need to develop a partnership with Oxford's library to accomplish this one.~
Update: Bodleian implements IIIF, a standard format for image description APIs: https://digital.bodleian.ox.ac.uk/developer/data/. They ostensibly use version 3 of the presentation API, but individual results link to version 2. They definitely do follow Dublin Core for terms definitions. That information is derived from the context endpoint: https://digital.bodleian.ox.ac.uk/api/1/context.json and their link to https://iiif.io/api/presentation/2/context.json on individual work manifests. This doesn't matter too much, other than that we can use it to get a general idea of the API shape. I've gone ahead and clarified some things below anyway though.
We can paginate through the search results by passing a
sort=published asc
query parameter. There are currently 20177 items. Page size can be 20, 40, or 100. If we took an extremely conservative approach to request throttling and made one request every 4 seconds, it would still take less than a day to work through the entire collection (20177 * 4 / 60 / 60 = about 22 and a half hours). We need to check each member of each page's manifest endpoint for theattribution
field OR arequiredStatement
where the value includes a mention of a CC licence: Bodleian uses both to designate openly licensed media. For example, this item usesrequiredStatement
: https://iiif.bodleian.ox.ac.uk/iiif/manifest/76528dc3-dfe6-4c28-ae5f-d1b2e61fa638.jsonWhereas this item uses
attribution
: https://iiif.bodleian.ox.ac.uk/iiif/manifest/60834383-7146-41ab-bfe1-48ee97bc04be.jsonHere is an example of an item that is not openly licensed: https://iiif.bodleian.ox.ac.uk/iiif/manifest/441db95d-cdff-472e-bb2d-b46f043db82d.json
After the first DAG run, subsequent DAG runs could skip if the total item count hasn't changed. We can also safely skip works we've already checked (regardless of whether they had an open licence), to reduce double work and load on their API in future iterations. This DAG would probably only realistically need to run once a month, maybe even only once every 3 months. If requested we could run it on demand as well, because it would not be a dated DAG.
Each work has a list of
metadata
, many of which should be incorporated into our search, potentially via tags, but we should also save it into our regularmeta_data
blob for future reference. For example theAuthor
metadata represents zero to many known authors of the digitised work (the digitisation should be attributed to theattribution
field or using the "Terms of Use"requiredStatement
, not the authors. The original authors of a work (e.g., a 15th century monk) is not the rights holder for the digitisation, it's whoever is in the attribution field!). Title is also found in the metadata. Additional interesting metadata that could be useful for search: Provenance, Binding, Collection, Holding Institution, Materials, Date Statement, Language... and so on.Each work has a thumbnail.
Checklist to complete before beginning development
Implementation