cc-archive / cccatalog-api

[PROJECT TRANSFERRED] The Creative Commons Catalog API allows programmatic access to search for CC-licensed and public domain digital media.
https://github.com/WordPress/openverse-api
MIT License
100 stars 122 forks source link

Expose image type in API #340

Closed kgodey closed 5 years ago

kgodey commented 5 years ago

We'd like to expose in the API database, for each image record in the catalog, what type(s) of image it is. This will be added as a filter to the frontend for users to filter by.

Note that one record can have multiple image types, e.g. an SVG can be both a vector and an illustration. On the frontend, if a user selects more than one filter, it is "or", i.e. we show them results for images that match any of the filters selected.

We'd like to expose the following image types:

aldenstpage commented 5 years ago

Assigned to Kriti to flesh this out some more - agree on final medium tags and decide which provider fits which tag

aldenstpage commented 5 years ago

Based on our conversations in Slack, I've gathered that the intent is to use the image provider and file extension of an image to try to make a reasonable guess at what medium the image might be.

Unfortunately, this approach has some pitfalls that may limit its usefulness.

We might be able to use machine vision to determine mediums; until we're in a position to do that, I suggest that we abandon this feature. Adding "painting", "photograph", "artwork", etc as a keyword in your searches will give better results.

I've created a proof-of-concept implementation of the proposed approach for feedback purposes; let me know what you think - https://github.com/creativecommons/cccatalog-api/pull/386

kgodey commented 5 years ago

@aldenstpage I didn't think we would categorize all providers; just the ones that we're sure about. For example, I think we could categorize all Met images as digitized artworks, but I don't think we could categorize them as photographs or illustrations because we don't know that for sure.

So if someone filters by "photograph", they should only see images that we're sure are photos, even if it filters out a bunch of providers that may provide photos but we're not sure.

@annatuma thoughts?

aldenstpage commented 5 years ago

That's a valid potential approach, although it's going to be a short list since most providers are mixed-medium. I think it would look like this:

    'met': [Category.DIGITIZED_ARTWORK],
    'svgsilh': [Category.VECTOR],
    'animaldiversity': [Category.PHOTOGRAPH],
    'WoRMS': [Category.PHOTOGRAPH],
    'CAPL': [Category.PHOTOGRAPH]
kgodey commented 5 years ago

Wouldn't other museums (e.g. CMA) also have the DIGITIZED_ARTWORK category? Also, PhyloPic would be VECTOR and ILLUSTRATION.

aldenstpage commented 5 years ago

I updated the category list in PR #386, feel free to tweak it if I missed anything else. The changes are being indexed in dev now, you'll be able to try this feature out in the morning.

annatuma commented 5 years ago

I want to make sure I understand properly.

As an example:

'deviantart': [
    Category.PHOTOGRAPH, Category.DIGITIZED_ARTWORK, Category.ILLUSTRATION,
    Category.VECTOR

This means that if someone filters by photograph OR by digitized artwork OR by illustration OR by vector, they'll always be searching the entire Deviant Art collection? And since Deviant Art contains all of these, they might filter by vector, but end up seeing photographs?

In that case, I think you were onto the right path earlier Alden, where it would be your short list, plus the couple that Kriti pointed out (other museums + PhyloPic). If we aren't sure what medium a file is going to be, then it makes more sense to me to not show it at all, versus potentially returning a result in the wrong medium.

What say you?

aldenstpage commented 5 years ago

Yes @annatuma that's right. Here's the updated list:

provider_category = {
    '__default': [],
    'thorvaldsenmuseum': [
        Category.DIGITIZED_ARTWORK
    ],
    'svgsilh': [Category.VECTOR, Category.ILLUSTRATION],
    'phylopic': [Category.VECTOR, Category.ILLUSTRATION],
    'floraon': [Category.PHOTOGRAPH],
    'animaldiversity': [Category.PHOTOGRAPH],
    'WoRMS': [Category.PHOTOGRAPH],
    'clevelandmuseum': [Category.DIGITIZED_ARTWORK],
    'CAPL': [Category.PHOTOGRAPH],
    'sciencemuseum': [Category.PHOTOGRAPH],
    'rijksmuseum': [Category.DIGITIZED_ARTWORK],
    'museumsvictoria': [Category.DIGITIZED_ARTWORK],
    'met': [Category.DIGITIZED_ARTWORK],
    'mccordmuseum': [Category.DIGITIZED_ARTWORK],
    'digitaltmuseum': [Category.DIGITIZED_ARTWORK],
    'deviantart': [Category.DIGITIZED_ARTWORK],
    'brooklynmuseum': [Category.DIGITIZED_ARTWORK]
}

There are still some design issues here. Illustration is exactly equivalent to vector, so we might as well remove it as an option. I don't think there are any data sources that we could tag as entirely illustrations. It would probably be better if this were data driven by the catalog instead of manually encoded into our search indexer, in the interest of keeping everything up to date.

Try it here - https://api-dev.creativecommons.engineering/image/search?q=test&categories=digitized_artwork

aldenstpage commented 5 years ago

@brenoferreira fyi

brenoferreira commented 5 years ago

Cool. I'll add this to this PR https://github.com/creativecommons/cccatalog-frontend/pull/533

aldenstpage commented 5 years ago

Pending indexing and deployment to prod