Expose image type in API

kgodey commented 5 years ago

We'd like to expose in the API database, for each image record in the catalog, what type(s) of image it is. This will be added as a filter to the frontend for users to filter by.

Note that one record can have multiple image types, e.g. an SVG can be both a vector and an illustration. On the frontend, if a user selects more than one filter, it is "or", i.e. we show them results for images that match any of the filters selected.

We'd like to expose the following image types:

Illustration (includes Clip Art for now)
Vector
Photograph
Digitized Artwork

aldenstpage commented 5 years ago

Assigned to Kriti to flesh this out some more - agree on final medium tags and decide which provider fits which tag

aldenstpage commented 5 years ago

Based on our conversations in Slack, I've gathered that the intent is to use the image provider and file extension of an image to try to make a reasonable guess at what medium the image might be.

Unfortunately, this approach has some pitfalls that may limit its usefulness.

A provider often has many types of images, meaning most providers will end up having very similar categories.
Some providers do not fit neatly into these categories. Which category does Thingiverse fit into?
Pretty much anything can be considered an illustration
The file extension can tell us whether an image is a vector and nothing else
If a provider has a single .svg image inside of it, does that mean we should also include the vector tag?
Every time a new data source is added to CC Search, someone needs to update categorize.py and add tags to it.

We might be able to use machine vision to determine mediums; until we're in a position to do that, I suggest that we abandon this feature. Adding "painting", "photograph", "artwork", etc as a keyword in your searches will give better results.

I've created a proof-of-concept implementation of the proposed approach for feedback purposes; let me know what you think - https://github.com/creativecommons/cccatalog-api/pull/386

kgodey commented 5 years ago

@aldenstpage I didn't think we would categorize all providers; just the ones that we're sure about. For example, I think we could categorize all Met images as digitized artworks, but I don't think we could categorize them as photographs or illustrations because we don't know that for sure.

So if someone filters by "photograph", they should only see images that we're sure are photos, even if it filters out a bunch of providers that may provide photos but we're not sure.

@annatuma thoughts?

aldenstpage commented 5 years ago

That's a valid potential approach, although it's going to be a short list since most providers are mixed-medium. I think it would look like this:

    'met': [Category.DIGITIZED_ARTWORK],
    'svgsilh': [Category.VECTOR],
    'animaldiversity': [Category.PHOTOGRAPH],
    'WoRMS': [Category.PHOTOGRAPH],
    'CAPL': [Category.PHOTOGRAPH]

kgodey commented 5 years ago

Wouldn't other museums (e.g. CMA) also have the DIGITIZED_ARTWORK category? Also, PhyloPic would be VECTOR and ILLUSTRATION.

aldenstpage commented 5 years ago

I updated the category list in PR #386, feel free to tweak it if I missed anything else. The changes are being indexed in dev now, you'll be able to try this feature out in the morning.

annatuma commented 5 years ago

I want to make sure I understand properly.

As an example:

'deviantart': [
    Category.PHOTOGRAPH, Category.DIGITIZED_ARTWORK, Category.ILLUSTRATION,
    Category.VECTOR

This means that if someone filters by photograph OR by digitized artwork OR by illustration OR by vector, they'll always be searching the entire Deviant Art collection? And since Deviant Art contains all of these, they might filter by vector, but end up seeing photographs?

In that case, I think you were onto the right path earlier Alden, where it would be your short list, plus the couple that Kriti pointed out (other museums + PhyloPic). If we aren't sure what medium a file is going to be, then it makes more sense to me to not show it at all, versus potentially returning a result in the wrong medium.

What say you?

aldenstpage commented 5 years ago

Yes @annatuma that's right. Here's the updated list:

provider_category = {
    '__default': [],
    'thorvaldsenmuseum': [
        Category.DIGITIZED_ARTWORK
    ],
    'svgsilh': [Category.VECTOR, Category.ILLUSTRATION],
    'phylopic': [Category.VECTOR, Category.ILLUSTRATION],
    'floraon': [Category.PHOTOGRAPH],
    'animaldiversity': [Category.PHOTOGRAPH],
    'WoRMS': [Category.PHOTOGRAPH],
    'clevelandmuseum': [Category.DIGITIZED_ARTWORK],
    'CAPL': [Category.PHOTOGRAPH],
    'sciencemuseum': [Category.PHOTOGRAPH],
    'rijksmuseum': [Category.DIGITIZED_ARTWORK],
    'museumsvictoria': [Category.DIGITIZED_ARTWORK],
    'met': [Category.DIGITIZED_ARTWORK],
    'mccordmuseum': [Category.DIGITIZED_ARTWORK],
    'digitaltmuseum': [Category.DIGITIZED_ARTWORK],
    'deviantart': [Category.DIGITIZED_ARTWORK],
    'brooklynmuseum': [Category.DIGITIZED_ARTWORK]
}

There are still some design issues here. Illustration is exactly equivalent to vector, so we might as well remove it as an option. I don't think there are any data sources that we could tag as entirely illustrations. It would probably be better if this were data driven by the catalog instead of manually encoded into our search indexer, in the interest of keeping everything up to date.

Try it here - https://api-dev.creativecommons.engineering/image/search?q=test&categories=digitized_artwork

aldenstpage commented 5 years ago

@brenoferreira fyi

brenoferreira commented 5 years ago

Cool. I'll add this to this PR https://github.com/creativecommons/cccatalog-frontend/pull/533

aldenstpage commented 5 years ago

Pending indexing and deployment to prod

cc-archive / cccatalog-api

Expose image type in API #340