Open obulat opened 2 years ago
A note on data refreshes & normalization that @obulat brought up: We should continue performing full data refreshes in dev until we are confident in our data normalization. Until we get everything normalized, we may continue to find issues in production that can't be replicated in staging unless we refresh the catalog in its entirety.
I've also made https://github.com/WordPress/openverse-infrastructure/issues/120 to track this
Problem
This is a meta issue to track all the data model normalization work across all the repositories. All open issues from this meta issue. You can also track the progress using the GitHub Project view. Some data we have in the database was ingested a long time ago when we had a different set of required fields. This makes consuming the data difficult because the pieces that are set as required can be unavailable in the database. We need to make sure that we have up-to-date data models across the stack, and that our data in the database confirms to them.
Description
To establish trust in our data, we need to make sure that we clearly describe what data we have, and to check that the database actually has all the data outlined. Also, we should remove the duplication of data classification/data cleaning between the Catalog and the API layers.
Here are the specific fields we should normalize:
All media
These fields are common for all media, however some fields only have
NULL
values in images, not in audio.URL
License URL
license_url
tometa_data
JSONB field in the database in the catalog. This field can be computed based on thelicense
andlicense_version
fields. We can run a SQL query or a one-off Python script to backfill it.license_url
after the data has been backfilled in the Catalog database.license_url
as a required field in the frontend types.Watermarked
false
in all images where it'sNULL
(in images only)Last synced with source
last_synced_with_source
to the value ofupdated_on
, if available, or tocreated_on
(in images only)Mature (new column)
Description (new column)
Image
Thumbnail
Filetype
563 004 660 images
Category
563 622 992 images
Width & height
12 571 694 images
width
andheight
and update them. Backfillwidth
andheight
values for images that don't have them (probably in the same process as thefilesize
update)width
andheight
values are returned. This will also improve the size and aspect ratio filters.Filesize
561 894 897 images
filesize
and update them. Backfillfilesize
values for images that don't have them (probably in the same process as thewidth
andheight
update)Tags
Improvements
More investigation needed
Additional context
Updates that can be done with existing data:
Updates that will require additional fetching from providers:
Message from @AetherUnbound with details from the database (from the Public Slack discussion): Here are a count of NULL values for all fields that don't have a NOT NULL constraint. Unfortunately this doesn't give us information on license_url, if that's supposed to come from the meta_data field