Normalize data models - Githubissues

Problem

This is a meta issue to track all the data model normalization work across all the repositories. All open issues from this meta issue. You can also track the progress using the GitHub Project view. Some data we have in the database was ingested a long time ago when we had a different set of required fields. This makes consuming the data difficult because the pieces that are set as required can be unavailable in the database. We need to make sure that we have up-to-date data models across the stack, and that our data in the database confirms to them.

Description

To establish trust in our data, we need to make sure that we clearly describe what data we have, and to check that the database actually has all the data outlined. Also, we should remove the duplication of data classification/data cleaning between the Catalog and the API layers.

Here are the specific fields we should normalize:

All media

These fields are common for all media, however some fields only have NULL values in images, not in audio.

[ ] WordPress/openverse#1410

URL

[ ] WordPress/openverse#1409

License URL

[ ] https://github.com/WordPress/openverse/issues/1565 Add license_url to meta_data JSONB field in the database in the catalog. This field can be computed based on the license and license_version fields. We can run a SQL query or a one-off Python script to backfill it.
[ ] https://github.com/WordPress/openverse/issues/703 Remove any code in the API that computes license_url after the data has been backfilled in the Catalog database.
[ ] https://github.com/WordPress/openverse/issues/552 Set license_url as a required field in the frontend types.

Watermarked

[ ] https://github.com/WordPress/openverse/issues/1563 Set the watermarked property to false in all images where it's NULL (in images only)

Last synced with source

[ ] https://github.com/WordPress/openverse/issues/1562 Set last_synced_with_source to the value of updated_on, if available, or to created_on (in images only)

Mature (new column)

[ ] WordPress/openverse#1754 Save mature info from the origin

Description (new column)

[ ] WordPress/openverse#1656

Image

Thumbnail

[x] https://github.com/WordPress/openverse/issues/1561 Remove the image thumbnail field from the catalog and from the provider scripts because we do not use provider thumbnails (we use the imaginary proxy server for image thumbnails instead).

Filetype

563 004 660 images

[ ] https://github.com/WordPress/openverse/issues/1560 Find a way of backfilling the image filetype values. It might be possible to compute it from the filename or the URL extension
[ ] https://github.com/WordPress/openverse/issues/702 Remove filetype computation from API and from ES index creation.

Width & height

12 571 694 images

[ ] WordPress/openverse#1551 Prefer the original or largest size available
[ ] https://github.com/WordPress/openverse/issues/1559 Check which provider scripts do not set width and height and update them. Backfill width and height values for images that don't have them (probably in the same process as the filesize update)
[ ] https://github.com/WordPress/openverse/issues/701 Re-index all images to make sure that the width and height values are returned. This will also improve the size and aspect ratio filters.
[ ] https://github.com/WordPress/openverse/issues/551 Remove width/height computation code from the frontend.

Filesize

561 894 897 images

[x] https://github.com/WordPress/openverse-catalog/issues/522 Check which provider scripts do not set filesize and update them. Backfill filesize values for images that don't have them (probably in the same process as the width and height update)

Improvements

[ ] WordPress/openverse#1663
[ ] WordPress/openverse#1546

More investigation needed

[ ] WordPress/openverse#1836
[ ] WordPress/openverse#1451
[ ] https://github.com/WordPress/openverse/issues/1556 There are 1 096 025 images that don't have a title. We should try to understand why those images don't have titles and add the titles if it is possible.
[ ] WordPress/openverse-catalog#782

Additional context

Updates that can be done with existing data:

tags
license_url
watermarked
last_synced_with

Updates that will require additional fetching from providers:

filetype
filesize
width
height

Message from @AetherUnbound with details from the database (from the Public Slack discussion): Here are a count of NULL values for all fields that don't have a NOT NULL constraint. Unfortunately this doesn't give us information on license_url, if that's supposed to come from the meta_data field

deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE duration IS NULL) as duration,
 COUNT(*) FILTER (WHERE bit_rate IS NULL) as bit_rate,
 COUNT(*) FILTER (WHERE sample_rate IS NULL) as sample_rate,
 COUNT(*) FILTER (WHERE category IS NULL) as category,
 COUNT(*) FILTER (WHERE genres IS NULL) as genres,
 COUNT(*) FILTER (WHERE audio_set IS NULL) as audio_set,
 COUNT(*) FILTER (WHERE set_position IS NULL) as set_position,
 COUNT(*) FILTER (WHERE alt_files IS NULL) as alt_files,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source
 FROM audio;
-[ RECORD 1 ]-------------------------
total                   | 175858
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 0
thumbnail               | 86720
filetype                | 0
duration                | 0
bit_rate                | 89223
sample_rate             | 149241
category                | 13844
genres                  | 86720
audio_set               | 34914
set_position            | 86720
alt_files               | 115789
filesize                | 89138
license_version         | 0
creator                 | 10
creator_url             | 118
title                   | 0
meta_data               | 0
tags                    | 30092
watermarked             | 0
last_synced_with_source | 0
SELECT 1
Time: 0.149s

deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE width IS NULL) as width,
 COUNT(*) FILTER (WHERE height IS NULL) as height,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE category IS NULL) as category
 FROM image;
-[ RECORD 1 ]-------------------------
total                   | 563667181
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 1
thumbnail               | 57584529
width                   | 12571694
height                  | 12571694
filesize                | 561894897
license_version         | 0
creator                 | 4459805
creator_url             | 22751618
title                   | 1096025
meta_data               | 366974
tags                    | 243751835
watermarked             | 1105608
last_synced_with_source | 554237
filetype                | 563004660
category                | 563622992
SELECT 1
Time: 2480.183s (41 minutes 20 seconds), executed in: 2480.182s (41 minutes 20 seconds)

WordPress / openverse

Normalize data models #244