WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
215 stars 176 forks source link

Audit provider scripts to collect `filetype` and `filesize` #1545

Closed obulat closed 1 year ago

obulat commented 2 years ago

Current Situation

Currently, a lot of image provider scripts are not collecting the filetype and filesize information. This information can improve the frontend performance and make Openverse friendlier to providers by not requiring a head request for each image that lacks this information.

Suggested Improvement

Image DAGs and what data they collect (the list is updated when the PRs are created or merged):

Temporarily disabled DAGs that will need to be fixed later:

Scripts that were already collecting filetype and filesize data:

Then, separately, we'd need to write a script to backfill all existing records. Finally, we would need a solution to collect the filetype and filesize for images whose provider scripts do not provide the data.

obulat commented 1 year ago

Auditing the provider scripts is done. Some scripts have been updated to collect available information.

For the providers that cannot be updated right away, I added comments to the follow up issues for related image dimension issues: https://github.com/WordPress/openverse/issues/1486 https://github.com/WordPress/openverse-catalog/issues/647#issuecomment-1224300654 WordPress/openverse#1484