Closed kgodey closed 5 years ago
Assigned to Kriti to flesh this out some more - agree on final medium tags and decide which provider fits which tag
Based on our conversations in Slack, I've gathered that the intent is to use the image provider and file extension of an image to try to make a reasonable guess at what medium the image might be.
Unfortunately, this approach has some pitfalls that may limit its usefulness.
categorize.py
and add tags to it.We might be able to use machine vision to determine mediums; until we're in a position to do that, I suggest that we abandon this feature. Adding "painting", "photograph", "artwork", etc as a keyword in your searches will give better results.
I've created a proof-of-concept implementation of the proposed approach for feedback purposes; let me know what you think - https://github.com/creativecommons/cccatalog-api/pull/386
@aldenstpage I didn't think we would categorize all providers; just the ones that we're sure about. For example, I think we could categorize all Met images as digitized artworks, but I don't think we could categorize them as photographs or illustrations because we don't know that for sure.
So if someone filters by "photograph", they should only see images that we're sure are photos, even if it filters out a bunch of providers that may provide photos but we're not sure.
@annatuma thoughts?
That's a valid potential approach, although it's going to be a short list since most providers are mixed-medium. I think it would look like this:
'met': [Category.DIGITIZED_ARTWORK],
'svgsilh': [Category.VECTOR],
'animaldiversity': [Category.PHOTOGRAPH],
'WoRMS': [Category.PHOTOGRAPH],
'CAPL': [Category.PHOTOGRAPH]
Wouldn't other museums (e.g. CMA) also have the DIGITIZED_ARTWORK
category? Also, PhyloPic would be VECTOR
and ILLUSTRATION
.
I updated the category list in PR #386, feel free to tweak it if I missed anything else. The changes are being indexed in dev
now, you'll be able to try this feature out in the morning.
I want to make sure I understand properly.
As an example:
'deviantart': [
Category.PHOTOGRAPH, Category.DIGITIZED_ARTWORK, Category.ILLUSTRATION,
Category.VECTOR
This means that if someone filters by photograph OR by digitized artwork OR by illustration OR by vector, they'll always be searching the entire Deviant Art collection? And since Deviant Art contains all of these, they might filter by vector, but end up seeing photographs?
In that case, I think you were onto the right path earlier Alden, where it would be your short list, plus the couple that Kriti pointed out (other museums + PhyloPic). If we aren't sure what medium a file is going to be, then it makes more sense to me to not show it at all, versus potentially returning a result in the wrong medium.
What say you?
Yes @annatuma that's right. Here's the updated list:
provider_category = {
'__default': [],
'thorvaldsenmuseum': [
Category.DIGITIZED_ARTWORK
],
'svgsilh': [Category.VECTOR, Category.ILLUSTRATION],
'phylopic': [Category.VECTOR, Category.ILLUSTRATION],
'floraon': [Category.PHOTOGRAPH],
'animaldiversity': [Category.PHOTOGRAPH],
'WoRMS': [Category.PHOTOGRAPH],
'clevelandmuseum': [Category.DIGITIZED_ARTWORK],
'CAPL': [Category.PHOTOGRAPH],
'sciencemuseum': [Category.PHOTOGRAPH],
'rijksmuseum': [Category.DIGITIZED_ARTWORK],
'museumsvictoria': [Category.DIGITIZED_ARTWORK],
'met': [Category.DIGITIZED_ARTWORK],
'mccordmuseum': [Category.DIGITIZED_ARTWORK],
'digitaltmuseum': [Category.DIGITIZED_ARTWORK],
'deviantart': [Category.DIGITIZED_ARTWORK],
'brooklynmuseum': [Category.DIGITIZED_ARTWORK]
}
There are still some design issues here. Illustration is exactly equivalent to vector, so we might as well remove it as an option. I don't think there are any data sources that we could tag as entirely illustrations. It would probably be better if this were data driven by the catalog instead of manually encoded into our search indexer, in the interest of keeping everything up to date.
Try it here - https://api-dev.creativecommons.engineering/image/search?q=test&categories=digitized_artwork
@brenoferreira fyi
Cool. I'll add this to this PR https://github.com/creativecommons/cccatalog-frontend/pull/533
Pending indexing and deployment to prod
We'd like to expose in the API database, for each image record in the catalog, what type(s) of image it is. This will be added as a filter to the frontend for users to filter by.
Note that one record can have multiple image types, e.g. an SVG can be both a vector and an illustration. On the frontend, if a user selects more than one filter, it is "or", i.e. we show them results for images that match any of the filters selected.
We'd like to expose the following image types: