cc-archive / open-ledger

Prototype code and examples for work on the Creative Commons "CC Search" project
MIT License
48 stars 23 forks source link

Integrate the Europeana API #162

Closed mzeinstra closed 7 years ago

mzeinstra commented 7 years ago

Europeana has 25 millions objects that are licensed under a CC license, marked with the PDM or where rights have been waived using CC0 that all have a linkable image attached. The Europeana API allows ease of ingestion of the data of these objects and the Europeana dataset could bring dozens of organisations into your meta search engine.

Go to labs.europeana.eu for information about the API.

Contact @DavidHaskiya for more information

DavidHaskiya commented 7 years ago

Hi, Our API-docs are here, http://labs.europeana.eu/api/introduction

I think of special interest to CC Search is that it's possible by faceting limit results by rights status, to only images of a certain resolution and so on. It's also possible to exclude hits from specified providers (like e.g. Rijksmuseum, to avoid duplicates). You can also limit the search to title and creator only.

Example of the above: A search for all records in Europeana that have images larger than 1 megapixels, are Public Domain marked/CC0/CC-BY/CC-BY-SA, have the word rembrandt in a creator or contributor field and are not from the Rijksmuseum.

http://www.europeana.eu/api/v2/search.json?query=NOT+PROVIDER%3A+%22Rijksmuseum%22+AND+who%3Arembrandt&media=true&qf=IMAGE_SIZE%3Alarge&qf=IMAGE_SIZE%3Aextra_large&reusability=open&qf=TYPE%3AIMAGE&profile=rich&wskey=yourapikey

I think each object returned in the Rich profile has enough metadata for you to populate the CC Search display, so I don't think you'd need to make any full record calls (each object has more metadata that can only be retrieved via a specific record call). The link to the actual image is in the "edmIsShownBy" field.

Play around with it and if you have any questions or suggestions get in touch with me or the product owner of our APIs Hugo, https://github.com/hugomanguinhas

lizadaly commented 7 years ago

Fantastic, thank you for the detailed description. We've always intended to include the Europeana collections in the project.

We ingest the metadata in batch and then do searches against our own index. Is it possible to issue a query with no search term? (Meaning, can we just politely crawl the archive for all items that are CC0/have an image?)

mzeinstra commented 7 years ago

Sure

Terms with CC licenses

RIGHTS:*zero* – for all cc0 material
RIGHTS:*mark – for all PDM material
RIGHTS*creative* - for all cc legal tools

you see the pattern right?

Note if you want CC BY or CC BY SA you need to use

RIGHTS:*CC-BY/*
RIGHTS:*CC-BY-SA/*

resp. to exclude CC-BY-* works

Want all works that have a link to a mediafile (and not only pure metadata or only a thumbnail? provider_aggregation_edm_isShownAt:*

Only 4MegaPixel images? IMAGE_SIZE:extra_large

Be sure to check out the section http://labs.europeana.eu/api/search#profile-parameter to see how you can tweak what is outputted.

chat with me on Slack to find more but also look at http://labs.europeana.eu/api/data-fields

DavidHaskiya commented 7 years ago

Just a small correction to what Maarten wrote. The check for object that have links to media files is: provider_aggregation_edm_isshownby:*

Note: Doesn't necessarily mean the file linked to resolves...

You can also make queries where you return 0 objects and also return facets. You can use that to e.g. first make a list of data providers (i.e. GLAMs who have published on Europeana) that you can then query one by one to fetch their items for indexing on your side.

Here's an example: http://www.europeana.eu/api/v2/search.json?query=NOT+PROVIDER%3A+%22Rijksmuseum%22+AND+*:*&media=true&qf=IMAGE_SIZE%3Alarge&qf=IMAGE_SIZE%3Aextra_large&reusability=open&qf=TYPE%3AIMAGE&profile=facets&rows=0&wskey=yourkey

Which in "natural language corresponds to: Search in Europeana and list all facets (profile=facets) where objects are not from the Rijksmuseum but no other criteria (star colon star), have images larger than 1 megapixels, are PDM/CC0/CC-BY/CC-BY-SA, but list me no objects (rows=0).

And then, search for and paginate over the data providers to actually fetch the metadata to index on your side. Max is a 100 at a time. Here, and I'm a bit counterintuitively starting with the data provider from the previous query with the fewest hits, "Bohusläns museum": http://www.europeana.eu/api/v2/search.json?query=NOT+PROVIDER%3A+%22Rijksmuseum%22+AND+DATA_PROVIDER:%22Bohusl%C3%A4ns%20museum%22&media=true&qf=IMAGE_SIZE%3Alarge&qf=IMAGE_SIZE%3Aextra_large&reusability=open&qf=TYPE%3AIMAGE&profile=rich&rows=100&wskey=yourkey

Then rinse and repeat.

lizadaly commented 7 years ago

The first release of this is done: 470,000 Europeana results are now on the site.