WordPress / openverse-api

The Openverse API allows programmatic access to search for CC-licensed and public domain digital media.
https://api.openverse.engineering/v1
MIT License
77 stars 51 forks source link

Count provider occurrences in executed query results #1084

Closed sarayourfriend closed 1 year ago

sarayourfriend commented 1 year ago

Problem

We do not currently have any visibility into the number of times a provider shows up in search queries or how many queries they show up in period.

When considering the concerns @AetherUnbound brought up in this Make post about iNaturalist, it became clear that we need to have visibility into these metrics so that we know how dramatic changes to existing providers and the addition of new providers is actually reflected in real user searches.

Description

In the search_controller.search method, before returning the search results, create a tally of the providers for each result present in the page being returned:

provider_occurrences = defaultdict(int)
for result in results:
    provider_occurences[result["provider"]] += 1

for provider, occurrences in provider_occurrences:
    redis.incr(f"provider_occurrences:{week_timestamp}:{provider}", occurrences)
    redis.incr(f"provider_appeared_in_searches:{week_timestamp}:{provider}", 1)

This will allow us to see (a) how many searches each provider actually shows up in over the course of a week and (b) how many results from each provider were shown in search. If we can collect this data before iNaturalist data is fully available in the API, then we'll be able to directly see the effect of adding iNaturalist, both in the count of how many iNaturalist results start to show up and in and potential decreases in result counts for other providers.

Alternatives

Store the entire list of result ids returned for all queries in a time series. This would allow us to do more different kinds of analysis long term but would take way longer for us to get usable results from for the purposes of measuring the impact of iNaturalist on the actual user query results.

rwidom commented 1 year ago

This may be obvious / already assumed, but it would be great to save the search terms and maybe the page number of the results here as well. Yes? And where would the info get stored? I could imagine stashing docs on s3 with that info as a quick solution, or setting up a new postgres table? I'm totally ignorant about the API, so please forgive me if these questions are silly in any way.

sarayourfriend commented 1 year ago

My plan was to store the tallies naively in Redis as described in the issue description, mostly because it is fast and simple to implement. Both seem important to me to avoid delaying turning on the iNaturalist DAG for very long.

I thought I remembered an issue existing in this repository to store queries and their results in Redis, but I can't find it now. It might have been #19, but I recall one with more details about what to store.

That being said, I'm happy to store whatever the folks with more data analysis skills than me think would be useful. I can definitely see how raw result position and page number would be useful information. If iNaturalist is dominating pages 10 – 200 maybe we care less as long as the first 3 or 4 pages are still as diverse.

@rwidom, can you suggest a data model for storing this in Postgres? Should we store individual searches with queries as a JSON blob and the results as a Postgres array of IDs, along with the page number? Page size and result position can be derived from the result's position in the array plus the length of the array multiplied by the page count. It also occurred to me that we could store each search result as a separate row with the overall position with a foreign key to the query stored in a separate table. Not sure which is more useful or easy to query.

One more complication, even if we store the individual searches, the API is cached pretty heavily in Cloudflare and repeated queries for the same terms will just used the cached response, so we wouldn't necessarily see the total frequency at which people are seeing iNaturalist, just how often it shows in queries overall. i.e., assume that for any unique query, there'd likely only be a single entry in a table storing the executed queries.

My primary goal though is to find a solution that we can implement quickly and easily throw out or iterate on. A table of queries and results would grow pretty quickly and be hard to change in the future, so I'd consider it something worth spending more time on up-front to get right.

If you can share more details about how you envision using S3 for this for a fast solution, I'd be grateful. I hadn't considered how we could use it for this.

AetherUnbound commented 1 year ago

Thank you for jumping on this @sarayourfriend, and for your thoughts @rwidom! I do think storing the preliminary information suggested in the issue description is a great start, and will at least give us a sense for how frequently iNaturalist shows up before & after the ingestion occurs. For now, I think this is sufficient - we could even avoid adding these statistics beyond page 3 or 4, as you mention @sarayourfriend. I do think having more information about our searches & provider diversity would be helpful, but for ensuring that iNaturalist doesn't flood all search results after we ingest it I believe that storing this information in Redis and checking it before/after ingestion will give us the impression we need!

I think we might want an RFC similar to the frontend event tracking for deciding which metrics and how to store the data related to searches themselves, and I don't think we'd want to block the iNaturalist ingestion for that effort. Looking at #1088, I think we'd want to avoid tracking searches in Postgres at this time so we don't have to worry about a migration for the API deployment.

sarayourfriend commented 1 year ago

I like the suggestion to restrict this data gathering to the top few pages. That will probably target the most relevant cases for this particular study.

Agreed to not block on a wider search metrics project.

Thanks for the input!

sarayourfriend commented 1 year ago

I've reverted the Postgres approach in the PR and switched it back to the Redis approach.

I'll add unit tests and then undraft the PR so we can get this in ASAP and avoid too many delays to iNaturalist getting into the API fully.