WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
237 stars 188 forks source link

Create `ContentProvider` API model for providers and `providers` endpoint #4825

Open obulat opened 3 weeks ago

obulat commented 3 weeks ago

Problem

We currently surface the media sources in the API at https://api.openverse.org/v1/images/stats (and similar URL for audio). However, we don't surface the providers. This does not allow us to show links correctly in the frontend, or have a good overview of the media data providers.

Description

Create a new ContentProvider model similar to ContentSource. Since the ContentSource uses the content_provider database table, the new table can be called provider. Create a new /providers endpoint, the corresponding serializer and view. Update the elasticsearch function to aggregate the providers.

Additional context

Originally suggested in a PR comment This feature will help fix Unexpected url from Provider link in Single result view

We should not add any relations to provider/source models yet (as was suggested by @sarayourfriend in one of the comments to #4238) because they could be complicated. Some providers could be sources on different providers (e.g., if we add Wellcome collection API script, it will be its own provider, while now it's also a source at some other provider).

sarayourfriend commented 3 weeks ago

Create a new ContentProvider model similar to ContentSource. Since the ContentSource uses the content_provider database table, the new table can be called provider.

Oh dear, these names are getting messy :sweat_smile: Luckily, the existing content_provider table should be very easy to move to a new name with zero downtime, because of its small size. @obulat are you willing to consider that approach for this issue?

Some providers could be sources on different providers (e.g., if we add Wellcome collection API script, it will be its own provider, while now it's also a source at some other provider).

This is true for Auckland Museum already, to give another example. We ingest it directly but its works are also in Wikimedia. I don't think we have a source for them, though, so it might be slightly different than what's going on with Wellcome.

sarayourfriend commented 3 weeks ago

@obulat can you clarify what particular problem this solves for the frontend? Can you clarify which links on the frontend are the issue for this? My impression was that all the links for the source/provider of single result pages now linked to the source collection. The source collection page links to the source, not the provider, so I couldn't think of a place in the frontend where we needed to link to the provider specifically, rather than the source.

This does not allow us to show links correctly in the frontend, or have a good overview of the media data providers.

I thought all providers are also themselves sources, so would be present in the stats endpoints. Is that not the case for all providers?

obulat commented 3 weeks ago

Oh dear, these names are getting messy 😅 Luckily, the existing content_provider table should be very easy to move to a new name with zero downtime, because of its small size. @obulat are you willing to consider that approach for this issue?

I would definitely prefer this approach! Would it be zero downtime because the table rename would be almost instant? I would also love to rename that table's fields because they still have provider in them.

obulat commented 3 weeks ago

I thought all providers are also themselves sources, so would be present in the stats endpoints. Is that not the case for all providers?

The museum consortiums can be providers, but not sources. Examples are the Finnish museums and the Smithsonian museum. Here, you see that there is a link to the source but no link to the provider: from Smithsonian and from Finnish Museums. I think Europeana should also be such provider, because all of its items have distinct sources, and no media is hosted on Europeana itself.

We could, of course, hard-code the provider URL on the frontend since there are only 3 such examples, as a hotfix for the frontend.

I thought all providers are also themselves sources, so would be present in the stats endpoints. Is that not the case for all providers?

By the way, this also slightly skews our view of the stats. For example, the Flickr stats refer only to the numbers from Flickr itself, NASA images (which are also from Flickr as provider) are not counted as Flickr items.

sarayourfriend commented 3 weeks ago

Would it be zero downtime because the table rename would be almost instant?

To make it zero downtime we need to follow the approach for zero-downtime database migrations laid out in this documentation. While you are correct that if we did a table rename would happen very quickly, the issue for zero-downtime is not necessarily about the length of time it takes for the data migration to run. Instead, it is that every deployment has two versions of the application running. If we renamed the table without following the process above, the previous version of the application would still try to reference the table by the old name. It would do so for any request it handles between the new canary starting and running the migration, and the old task actually being replaced by the new tasks. Presently, that delay can be as long as 20 minutes for any individual task, but even if it was 1 second, it would still be a problem.

It will take at multiple PRs to do a table rename process. The documentation I linked uses a column rename as the example, but a table rename follows nearly an identical process.

Renaming the fields would add complexity. I'm not confident about this guess, but it might be possible to do it simultaneously with the table rename, albeit with more complex individual PRs.

The museum consortiums can be providers, but not sources. Examples are the Finnish museums and the Smithsonian museum. Here, you see that there is a link to the source but no link to the provider: from Smithsonian and from Finnish Museums.

I see. On those pages, what would the change be if we had provider-specific information? Is the goal to link to the provider in "Image information" section? Just trying to make sure the reason for a new provider API is clear.

If we add this new Provider table, would each instance be per-media type like ContentSource? It struck me as interesting that we treat Wikimedia the way we do, creating a new provider for additional media types by appending the media type (as in wikimedia_audio). Curiously, the ContentSource being per-media-type would theoretically not require that distinction, but should we carry the media type column forward to ProviderSource for consistency?

By the way, this also slightly skews our view of the stats. For example, the Flickr stats refer only to the numbers from Flickr itself, NASA images (which are also from Flickr as provider) are not counted as Flickr items.

Can you clarify in what sense is skews the stats? Wouldn't it theoretically "double up" the count of works like those from NASA, if they were included in the Flickr source? Does this come from the continued difficulty of communicating the difference between source and provider? For the new provider API endpoint, would we display per-provider media counts, like we do for sources, or just simple provider metadata?

If it's just metadata like links and names, another idea could be to have that as a JSON file in the repository, and make it importable via a combined Python and JS package, instead of needing to add new APIs (which we'd have to support). If there continues to be difficult in communicating the precise difference between provider and source, it might make things even more confusing if we had a public API to retrieve a directory of the providers. If we only need it for the frontend, then we could even just store it as JSON there, and not involve the API at all? Just thinking of alternatives, in case a provider-specific API endpoint only has a single intended use case.