WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
254 stars 204 forks source link

Cache oEmbed endpoint's derived image metadata (image dimensions) #2459

Open sarayourfriend opened 1 year ago

sarayourfriend commented 1 year ago

Problem

The embed endpoint retrieves image height and width if they do not exist on the record:

https://github.com//WordPress/openverse/blob/HEAD/api/api/views/image_views.py#L91-L97

There are two problems:

  1. The endpoint does this without caching the result so must remake the request every time it is requested
  2. This is valuable data that should 100% be back-filled to the catalog when available for a given result

Description

Luckily, we can address both of these with one fix: cache the result in Redis. While we'll need further work to backfill the data into the catalog (probably as part of https://github.com/WordPress/openverse/issues/420) if we cache this data in Redis with a stable key, we can prevent unnecessary duplicate requests and set the other project up for success.

Cache these with a key along the following lines:

imgdims:{identifier.replace("-", "")}

Ensure the identifier has hyphens removed to optimise the Redis key size. See #2400 for more details.

Save the image dimensions to this cache as a comma separated string: {width},{height}. We don't need a hashmap here and a string will be slightly smaller long-term.

Before requesting the image in the endpoint, check the cache first and reuse the dimensions from there if they exist.

Additional context

Related to https://github.com/WordPress/openverse/issues/1486 and https://github.com/WordPress/openverse/issues/420

@stacimc Do you know if this would be fundamentally unnecessary if #1486 is implemented? Would the result of that work result in all images having dimensions in the catalog immediately upon ingestion regardless of provider?

stacimc commented 1 year ago

@stacimc Do you know if this would be fundamentally unnecessary if https://github.com/WordPress/openverse/issues/1486 is implemented? Would the result of that work result in all images having dimensions in the catalog immediately upon ingestion regardless of provider?

For #1486 we would populate dimensions in the catalog at ingestion for all providers, yes. Once implemented, we'd immediately get image dimensions for newly ingested data and for non-dated DAGs, but we'd need to backfill records for dated providers. Luckily, I think there's only Metropolitan to worry about so that ought to be very quick with the new batched_update DAG :)