WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
240 stars 194 forks source link

Not sure if we have `descriptions` for each media item #1656

Open zackkrida opened 2 years ago

zackkrida commented 2 years ago

Description

Relates to https://github.com/WordPress/openverse/issues/1667 in the main Openverse repository. I'm currently unsure if we collect a textual description of each media item in the catalog. This would be a nice thing to expose via the api and display in the frontend.

AetherUnbound commented 2 years ago

Just to be clear - are these descriptions that would be made available by a provider's API, or something we would need to generate?

zackkrida commented 2 years ago

made available by a provider's API

This one! For example:

Freesound

I was recording into my Roland R26, via a mixer board, several sound effects from an old vynil LP of BBC Sound effects. At the time I did not notice that I had recorded this sound, when I put the needle down in an area between two tracks. Inspecting the surface afterwards, one could see some slight scratching in that area of the LP.

Flickr

Early morning surf

It seems more relevant to audio, but applicable to all media types.

AetherUnbound commented 2 years ago

Perfect, thank you for those examples! I think it's probably worth keeping for all media types (I could imagine images, books, 3D models, all having descriptions). I would be concerned about the performance impact of making that field searchable though. Perhaps it's not a huge issue if we're indexing it on ElasticSearch

zackkrida commented 2 years ago

I would be concerned about the performance impact of making that field searchable though.

I agree! I'm cool with treating it like single-result metadata only, not a searchable field.

rwidom commented 2 years ago

I'm not sure either, and would love to learn more, but here are some things I have noticed:

Does this mean that whenever the data is ready, we're ready to index it on elastic search to make it searchable, but we would have to either fully refresh the entire catalog database or accept that it would be only available on a limited set of records at first? Are there any performance concerns on the elastic search side if we go with the former approach? Any bias / relevance concerns on the elastic search side if we go with the incremental approach?

AetherUnbound commented 2 years ago

Thanks for doing that dive Rebecca! By the looks of it, the Elasticsearch mapping is looking for that information within the media's meta_data JSON blob, so specifically meta_data["description"]. The implication for me here is that description can be added where reasonable/possible in provider scripts, but we don't have a column for it on any of the media tables. If anything, I think if we start to add descriptions, we might get more relevant results!

The question for me is, do we want to surface it as a nullable column or leave it as an optional meta_data field. I'm inclined to use the latter, since modifying the schema of the catalog can be an ordeal and we already have the machinery set up to parse it from meta_data. @stacimc @obulat do you have any additional thoughts here?

rwidom commented 2 years ago

Thanks @AetherUnbound, that makes a ton of sense. I just looked through some of the provider scripts and I see that some do rename similar fields to be description in the metadata. And I now see there is this note in the image media store.

sarayourfriend commented 1 year ago

Perfect, thank you for those examples! I think it's probably worth keeping for all media types (I could imagine images, books, 3D models, all having descriptions). I would be concerned about the performance impact of making that field searchable though. Perhaps it's not a huge issue if we're indexing it on ElasticSearch

The description for results is part of the searched fields in the API. We search title, tags, and description: https://github.com/WordPress/openverse-api/blob/main/api/catalog/api/controllers/search_controller.py#L340

This is effective, by the way, in that you will sometimes get results for which the only field that produced a hit was the description. We don't surface the description on the frontend, however. Currently this feature is pretty obscure.

zackkrida commented 1 year ago

We don't surface the description on the frontend, however. Currently this feature is pretty obscure.

Yes, and this has proven quite confusing for users, we've recieved a few reports and comments asking "Why am I being shown this result?" because the matched text was in the description field not shown in our UI.

obulat commented 6 months ago

@zackkrida, do you still think we should surface the descriptions in the API? This could probably be achieved by adding description property to the media serializer using the meta_data.description property from the database.

sarayourfriend commented 6 months ago

We have a rather significant issue related to cultural sensitivity that would depend on that: #2594

What's the reason not to include descriptions, at least in the single-results view for a work? I could see omitting them from search if we are worried about the potential for them to cause significantly increased data transfer for search (some image descriptions could be longer than all the other metadata for a given work combined, especially from Flickr blog-post style images).

Even if something doesn't have a description, it can just be null in that case.

That's a separate question than the title of this issue "Not sure if we have descriptions for each media item". Can we move the discussion to the other issue, or a new issue specifically about surfacing the description in some part of the API, rather than this issue?