Open zackkrida opened 2 years ago
Just to be clear - are these descriptions that would be made available by a provider's API, or something we would need to generate?
made available by a provider's API
This one! For example:
I was recording into my Roland R26, via a mixer board, several sound effects from an old vynil LP of BBC Sound effects. At the time I did not notice that I had recorded this sound, when I put the needle down in an area between two tracks. Inspecting the surface afterwards, one could see some slight scratching in that area of the LP.
Early morning surf
It seems more relevant to audio, but applicable to all media types.
Perfect, thank you for those examples! I think it's probably worth keeping for all media types (I could imagine images, books, 3D models, all having descriptions). I would be concerned about the performance impact of making that field searchable though. Perhaps it's not a huge issue if we're indexing it on ElasticSearch
I would be concerned about the performance impact of making that field searchable though.
I agree! I'm cool with treating it like single-result metadata only, not a searchable field.
I'm not sure either, and would love to learn more, but here are some things I have noticed:
image.description
in the image DDL or audio.description
in the audio DDL, at least in the local staging DBDoes this mean that whenever the data is ready, we're ready to index it on elastic search to make it searchable, but we would have to either fully refresh the entire catalog database or accept that it would be only available on a limited set of records at first? Are there any performance concerns on the elastic search side if we go with the former approach? Any bias / relevance concerns on the elastic search side if we go with the incremental approach?
Thanks for doing that dive Rebecca! By the looks of it, the Elasticsearch mapping is looking for that information within the media's meta_data
JSON blob, so specifically meta_data["description"]
. The implication for me here is that description can be added where reasonable/possible in provider scripts, but we don't have a column for it on any of the media tables. If anything, I think if we start to add descriptions, we might get more relevant results!
The question for me is, do we want to surface it as a nullable column or leave it as an optional meta_data
field. I'm inclined to use the latter, since modifying the schema of the catalog can be an ordeal and we already have the machinery set up to parse it from meta_data
. @stacimc @obulat do you have any additional thoughts here?
Thanks @AetherUnbound, that makes a ton of sense. I just looked through some of the provider scripts and I see that some do rename similar fields to be description
in the metadata. And I now see there is this note in the image media store.
Perfect, thank you for those examples! I think it's probably worth keeping for all media types (I could imagine images, books, 3D models, all having descriptions). I would be concerned about the performance impact of making that field searchable though. Perhaps it's not a huge issue if we're indexing it on ElasticSearch
The description for results is part of the searched fields in the API. We search title, tags, and description: https://github.com/WordPress/openverse-api/blob/main/api/catalog/api/controllers/search_controller.py#L340
This is effective, by the way, in that you will sometimes get results for which the only field that produced a hit was the description. We don't surface the description on the frontend, however. Currently this feature is pretty obscure.
We don't surface the description on the frontend, however. Currently this feature is pretty obscure.
Yes, and this has proven quite confusing for users, we've recieved a few reports and comments asking "Why am I being shown this result?" because the matched text was in the description field not shown in our UI.
@zackkrida, do you still think we should surface the descriptions in the API? This could probably be achieved by adding description
property to the media serializer using the meta_data.description
property from the database.
We have a rather significant issue related to cultural sensitivity that would depend on that: #2594
What's the reason not to include descriptions, at least in the single-results view for a work? I could see omitting them from search if we are worried about the potential for them to cause significantly increased data transfer for search (some image descriptions could be longer than all the other metadata for a given work combined, especially from Flickr blog-post style images).
Even if something doesn't have a description, it can just be null in that case.
That's a separate question than the title of this issue "Not sure if we have descriptions for each media item". Can we move the discussion to the other issue, or a new issue specifically about surfacing the description in some part of the API, rather than this issue?
Description
Relates to https://github.com/WordPress/openverse/issues/1667 in the main Openverse repository. I'm currently unsure if we collect a textual description of each media item in the catalog. This would be a nice thing to expose via the api and display in the frontend.