WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
215 stars 177 forks source link

Getty Museum #3893

Open tieguy opened 3 months ago

tieguy commented 3 months ago

Source API Endpoint / Documentation

https://data.getty.edu/museum/collection/docs/

Provider description

The Getty Museum is a well-known American art museum, and just announced addition of 88,000 CC0-licensed images to the collection, available through their API.

Licenses Provided

CC0 and others

Provider API Technical info

I don't know.

Checklist to complete before beginning development

Implementation

JinEnMok commented 3 months ago

They're using the IIIF image API to access individual objects. It supports everything your checklist requires, except searching. For that, there's a separate "SPARQL" API, not sure how to use that.

The "data" query's response for an object contains a shows field, which contains at least one link to "image" queries, whose access_points entries contain links to images proper.

zackkrida commented 3 months ago

This is excellent, thank you both! Some more details I've explored:

CC0 images can be detected like so:

If the value at subject_to[0].classified_as[0].id is "https://creativecommons.org/publicdomain/zero/1.0/", then you're free to use the image without Getty's permission.

https://data.getty.edu/museum/collection/docs/#exception-1-images

One caveat is that some of their metadata (descriptions of images, in this case) are not CC0, which can be identified via the referred_to_by[0].subject_to[0].classified_as[0].id key:

https://data.getty.edu/museum/collection/docs/#exception-2-written-descriptions

If the value is https://creativecommons.org/publicdomain/zero/1.0/, then you're free to use the text however you'd like. If it is https://creativecommons.org/licenses/by/4.0/, you can use the text as you'd like with appropriate attribution.

Fortunately we should be able to use either and simply attribute it properly.


Concerning reingestion or updating records...they have an API for that, conveniently:

https://data.getty.edu/museum/collection/docs/#tracking-changes

Tracking Changes

The second task is to be able to know when changes happen in records. What we've found is that many users cache our records and want the latest data—but they don't want to re-download the entire collection looking for changes—particularly since our records don't change that often!

Instead of forcing users to do that, we use the ActivityStreams protocol to publish an API that lists every record that's been created, edited, or deleted in date order. This standard emerged from social media—think of it as a Twitter feed. Each activity has information about what happened, who did it, when they did it, and what they did it to. It's like a tweet every time a record changed! For the Museum Collection, this feed is available at:

https://data.getty.edu/museum/collection/activity-stream

The API provides a list of pages of activities—you can access the first page at https://data.getty.edu/museum/collection/activity-stream/page/1, which are the very first changes made to the API. You could also access page 11000, recording some of the changes that happened in March 2021.

Each page lists activities—for example, this activity records a change to our record for LACMA, showing that it was updated on March 1st, 2021.

You can use the ActivityStream to get a list of every record we have in our system, by starting at the first page and crawling forward, keeping track of everything that's been created and deleted. If you already have a copy of our data, though, you can start at the last page and crawl backwards, only pulling the records that have changed since the last time you scanned!

JinEnMok commented 3 months ago

@zackkrida I hadn't realised ActivityStream could be used to crawl the collection like that. That's very useful for a pet project of mine, thanks!

zackkrida commented 1 month ago

At lunch today I experimented with the ActivityStream endpoint a bit with a small python script. I did use ChatGPT to fix my code up a bit:

import aiohttp
import asyncio
import json
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

ACTIVITY_STREAM_ENDPOINT = "https://data.getty.edu/museum/collection/activity-stream"
HEADERS = {"Accept": "application/json"}
CONCURRENT_REQUESTS = 50  # Number of pages to fetch concurrently
OUTPUT_FILE = "filtered_data.json"  # Output file for filtered data

async def fetch_activity_stream_page(session, page):
    url = f"{ACTIVITY_STREAM_ENDPOINT}/page/{page}"
    logger.info(f"Fetching data from {url}...")

    async with session.get(url, headers=HEADERS) as response:
        if response.status == 200:
            logger.info("Data fetched successfully.")
            return await response.json()
        else:
            logger.error(f"Error: {response.status}")
            return None

def process_data(data):
    if 'orderedItems' in data:
        filtered_items = [
            item for item in data['orderedItems']
            if item['object']['type'] == 'HumanMadeObject'
        ]
        return filtered_items
    return []

async def fetch_total_pages(session):
    url = ACTIVITY_STREAM_ENDPOINT
    logger.info(f"Fetching total number of pages from {url}...")

    async with session.get(url, headers=HEADERS) as response:
        if response.status == 200:
            data = await response.json()
            total_pages = data.get('last', {}).get('id', '').split('/')[-1]
            return int(total_pages) if total_pages.isdigit() else 0
        else:
            logger.error(f"Error: {response.status}")
            return 0

async def fetch_all_pages():
    all_data = []
    unique_types = set()
    async with aiohttp.ClientSession() as session:
        total_pages = await fetch_total_pages(session)
        logger.info(f"Total pages to fetch: {total_pages}")

        for i in range(0, total_pages, CONCURRENT_REQUESTS):
            tasks = [
                fetch_activity_stream_page(session, page)
                for page in range(i + 1, min(i + CONCURRENT_REQUESTS + 1, total_pages + 1))
            ]
            pages_data = await asyncio.gather(*tasks)
            for data in pages_data:
                if data:
                    filtered_data = process_data(data)
                    all_data.extend(filtered_data)
                    for item in filtered_data:
                        unique_types.add(item['object']['type'])

    return all_data, unique_types

def main():
    loop = asyncio.get_event_loop()
    all_data, unique_types = loop.run_until_complete(fetch_all_pages())

    if all_data:
        logger.info(f"Writing filtered data to {OUTPUT_FILE}")
        with open(OUTPUT_FILE, "w") as f:
            json.dump(all_data, f, indent=2)
    else:
        logger.info("No relevant data returned.")

    if unique_types:
        logger.info("Unique types found:")
        for unique_type in unique_types:
            print(unique_type)
    else:
        logger.info("No unique types found.")

if __name__ == "__main__":
    main()

This saves all of the HumanMadeObjects from Getty into a json file. Here's a sample of a few results:

[
  {
    "id": "https://data.getty.edu/museum/collection/activity-stream/b109894e-89b2-4a14-bfb3-a1cfc4cdb979",
    "type": "Create",
    "created": "2020-05-12T01:04:39",
    "endTime": "2020-05-12T01:04:39",
    "object": {
      "id": "https://data.getty.edu/museum/collection/object/08eaed9f-1354-4817-8aed-1db49e893a03",
      "type": "HumanMadeObject"
    }
  },
  {
    "id": "https://data.getty.edu/museum/collection/activity-stream/1d82729f-dff4-4719-a277-7effbe87c122",
    "type": "Create",
    "created": "2020-05-12T01:04:40",
    "endTime": "2020-05-12T01:04:40",
    "object": {
      "id": "https://data.getty.edu/museum/collection/object/637c32e7-f087-459e-816a-292e27fa95b0",
      "type": "HumanMadeObject"
    }
  },
  {
    "id": "https://data.getty.edu/museum/collection/activity-stream/6c87e86a-6126-46c9-92f8-3b53da92f37b",
    "type": "Create",
    "created": "2020-05-12T01:04:40",
    "endTime": "2020-05-12T01:04:40",
    "object": {
      "id": "https://data.getty.edu/museum/collection/object/948313c3-52f4-44b3-ba05-9367eb9aec84",
      "type": "HumanMadeObject"
    }
  },
  {
    "id": "https://data.getty.edu/museum/collection/activity-stream/2ede7144-cd3d-4f0c-bec3-5c37f8522c6f",
    "type": "Create",
    "created": "2020-05-12T01:04:40",
    "endTime": "2020-05-12T01:04:40",
    "object": {
      "id": "https://data.getty.edu/museum/collection/object/31a2c0d3-241c-4a5c-9e32-abdaf418e58d",
      "type": "HumanMadeObject"
    }
  },
]

Each object.id then links off to the full metadata for a work! I'm so excited to build a DAG for Openverse off of this API.

The type: Create information is also going to be so useful for the purposes of reingesting and updating records. This endpoint will provide us a receipt of all updates to records so we can easily only amend all records which have been modified, and never need to reingest data unnecessarily.