Open tieguy opened 8 months ago
They're using the IIIF image API to access individual objects. It supports everything your checklist requires, except searching. For that, there's a separate "SPARQL" API, not sure how to use that.
The "data" query's response for an object contains a shows
field, which contains at least one link to "image" queries, whose access_points
entries contain links to images proper.
This is excellent, thank you both! Some more details I've explored:
CC0 images can be detected like so:
If the value at
subject_to[0].classified_as[0].id
is "https://creativecommons.org/publicdomain/zero/1.0/", then you're free to use the image without Getty's permission.
https://data.getty.edu/museum/collection/docs/#exception-1-images
One caveat is that some of their metadata (descriptions of images, in this case) are not CC0, which can be identified via the referred_to_by[0].subject_to[0].classified_as[0].id
key:
https://data.getty.edu/museum/collection/docs/#exception-2-written-descriptions
If the value is https://creativecommons.org/publicdomain/zero/1.0/, then you're free to use the text however you'd like. If it is https://creativecommons.org/licenses/by/4.0/, you can use the text as you'd like with appropriate attribution.
Fortunately we should be able to use either and simply attribute it properly.
Concerning reingestion or updating records...they have an API for that, conveniently:
https://data.getty.edu/museum/collection/docs/#tracking-changes
Tracking Changes
The second task is to be able to know when changes happen in records. What we've found is that many users cache our records and want the latest data—but they don't want to re-download the entire collection looking for changes—particularly since our records don't change that often!
Instead of forcing users to do that, we use the ActivityStreams protocol to publish an API that lists every record that's been created, edited, or deleted in date order. This standard emerged from social media—think of it as a Twitter feed. Each activity has information about what happened, who did it, when they did it, and what they did it to. It's like a tweet every time a record changed! For the Museum Collection, this feed is available at:
https://data.getty.edu/museum/collection/activity-stream
The API provides a list of pages of activities—you can access the first page at https://data.getty.edu/museum/collection/activity-stream/page/1, which are the very first changes made to the API. You could also access page 11000, recording some of the changes that happened in March 2021.
Each page lists activities—for example, this activity records a change to our record for LACMA, showing that it was updated on March 1st, 2021.
You can use the ActivityStream to get a list of every record we have in our system, by starting at the first page and crawling forward, keeping track of everything that's been created and deleted. If you already have a copy of our data, though, you can start at the last page and crawl backwards, only pulling the records that have changed since the last time you scanned!
@zackkrida I hadn't realised ActivityStream could be used to crawl the collection like that. That's very useful for a pet project of mine, thanks!
At lunch today I experimented with the ActivityStream endpoint a bit with a small python script. I did use ChatGPT to fix my code up a bit:
import aiohttp
import asyncio
import json
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
ACTIVITY_STREAM_ENDPOINT = "https://data.getty.edu/museum/collection/activity-stream"
HEADERS = {"Accept": "application/json"}
CONCURRENT_REQUESTS = 50 # Number of pages to fetch concurrently
OUTPUT_FILE = "filtered_data.json" # Output file for filtered data
async def fetch_activity_stream_page(session, page):
url = f"{ACTIVITY_STREAM_ENDPOINT}/page/{page}"
logger.info(f"Fetching data from {url}...")
async with session.get(url, headers=HEADERS) as response:
if response.status == 200:
logger.info("Data fetched successfully.")
return await response.json()
else:
logger.error(f"Error: {response.status}")
return None
def process_data(data):
if 'orderedItems' in data:
filtered_items = [
item for item in data['orderedItems']
if item['object']['type'] == 'HumanMadeObject'
]
return filtered_items
return []
async def fetch_total_pages(session):
url = ACTIVITY_STREAM_ENDPOINT
logger.info(f"Fetching total number of pages from {url}...")
async with session.get(url, headers=HEADERS) as response:
if response.status == 200:
data = await response.json()
total_pages = data.get('last', {}).get('id', '').split('/')[-1]
return int(total_pages) if total_pages.isdigit() else 0
else:
logger.error(f"Error: {response.status}")
return 0
async def fetch_all_pages():
all_data = []
unique_types = set()
async with aiohttp.ClientSession() as session:
total_pages = await fetch_total_pages(session)
logger.info(f"Total pages to fetch: {total_pages}")
for i in range(0, total_pages, CONCURRENT_REQUESTS):
tasks = [
fetch_activity_stream_page(session, page)
for page in range(i + 1, min(i + CONCURRENT_REQUESTS + 1, total_pages + 1))
]
pages_data = await asyncio.gather(*tasks)
for data in pages_data:
if data:
filtered_data = process_data(data)
all_data.extend(filtered_data)
for item in filtered_data:
unique_types.add(item['object']['type'])
return all_data, unique_types
def main():
loop = asyncio.get_event_loop()
all_data, unique_types = loop.run_until_complete(fetch_all_pages())
if all_data:
logger.info(f"Writing filtered data to {OUTPUT_FILE}")
with open(OUTPUT_FILE, "w") as f:
json.dump(all_data, f, indent=2)
else:
logger.info("No relevant data returned.")
if unique_types:
logger.info("Unique types found:")
for unique_type in unique_types:
print(unique_type)
else:
logger.info("No unique types found.")
if __name__ == "__main__":
main()
This saves all of the HumanMadeObject
s from Getty into a json file. Here's a sample of a few results:
[
{
"id": "https://data.getty.edu/museum/collection/activity-stream/b109894e-89b2-4a14-bfb3-a1cfc4cdb979",
"type": "Create",
"created": "2020-05-12T01:04:39",
"endTime": "2020-05-12T01:04:39",
"object": {
"id": "https://data.getty.edu/museum/collection/object/08eaed9f-1354-4817-8aed-1db49e893a03",
"type": "HumanMadeObject"
}
},
{
"id": "https://data.getty.edu/museum/collection/activity-stream/1d82729f-dff4-4719-a277-7effbe87c122",
"type": "Create",
"created": "2020-05-12T01:04:40",
"endTime": "2020-05-12T01:04:40",
"object": {
"id": "https://data.getty.edu/museum/collection/object/637c32e7-f087-459e-816a-292e27fa95b0",
"type": "HumanMadeObject"
}
},
{
"id": "https://data.getty.edu/museum/collection/activity-stream/6c87e86a-6126-46c9-92f8-3b53da92f37b",
"type": "Create",
"created": "2020-05-12T01:04:40",
"endTime": "2020-05-12T01:04:40",
"object": {
"id": "https://data.getty.edu/museum/collection/object/948313c3-52f4-44b3-ba05-9367eb9aec84",
"type": "HumanMadeObject"
}
},
{
"id": "https://data.getty.edu/museum/collection/activity-stream/2ede7144-cd3d-4f0c-bec3-5c37f8522c6f",
"type": "Create",
"created": "2020-05-12T01:04:40",
"endTime": "2020-05-12T01:04:40",
"object": {
"id": "https://data.getty.edu/museum/collection/object/31a2c0d3-241c-4a5c-9e32-abdaf418e58d",
"type": "HumanMadeObject"
}
},
]
Each object.id
then links off to the full metadata for a work! I'm so excited to build a DAG for Openverse off of this API.
The type: Create
information is also going to be so useful for the purposes of reingesting and updating records. This endpoint will provide us a receipt of all updates to records so we can easily only amend all records which have been modified, and never need to reingest data unnecessarily.
Source API Endpoint / Documentation
https://data.getty.edu/museum/collection/docs/
Provider description
The Getty Museum is a well-known American art museum, and just announced addition of 88,000 CC0-licensed images to the collection, available through their API.
Licenses Provided
CC0 and others
Provider API Technical info
I don't know.
Checklist to complete before beginning development
Implementation