WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
240 stars 194 forks source link

Normalize data models #244

Open obulat opened 2 years ago

obulat commented 2 years ago

Problem

This is a meta issue to track all the data model normalization work across all the repositories. All open issues from this meta issue. You can also track the progress using the GitHub Project view. Some data we have in the database was ingested a long time ago when we had a different set of required fields. This makes consuming the data difficult because the pieces that are set as required can be unavailable in the database. We need to make sure that we have up-to-date data models across the stack, and that our data in the database confirms to them.

Description

To establish trust in our data, we need to make sure that we clearly describe what data we have, and to check that the database actually has all the data outlined. Also, we should remove the duplication of data classification/data cleaning between the Catalog and the API layers.

Here are the specific fields we should normalize:

All media

These fields are common for all media, however some fields only have NULL values in images, not in audio.

URL

License URL

Watermarked

Last synced with source

Mature (new column)

Description (new column)

Image

Thumbnail

Filetype

563 004 660 images

Category

563 622 992 images

Width & height

12 571 694 images

Filesize

561 894 897 images

Tags


Improvements


More investigation needed

Additional context

Updates that can be done with existing data:

Updates that will require additional fetching from providers:

Message from @AetherUnbound with details from the database (from the Public Slack discussion): Here are a count of NULL values for all fields that don't have a NOT NULL constraint. Unfortunately this doesn't give us information on license_url, if that's supposed to come from the meta_data field

deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE duration IS NULL) as duration,
 COUNT(*) FILTER (WHERE bit_rate IS NULL) as bit_rate,
 COUNT(*) FILTER (WHERE sample_rate IS NULL) as sample_rate,
 COUNT(*) FILTER (WHERE category IS NULL) as category,
 COUNT(*) FILTER (WHERE genres IS NULL) as genres,
 COUNT(*) FILTER (WHERE audio_set IS NULL) as audio_set,
 COUNT(*) FILTER (WHERE set_position IS NULL) as set_position,
 COUNT(*) FILTER (WHERE alt_files IS NULL) as alt_files,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source
 FROM audio;
-[ RECORD 1 ]-------------------------
total                   | 175858
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 0
thumbnail               | 86720
filetype                | 0
duration                | 0
bit_rate                | 89223
sample_rate             | 149241
category                | 13844
genres                  | 86720
audio_set               | 34914
set_position            | 86720
alt_files               | 115789
filesize                | 89138
license_version         | 0
creator                 | 10
creator_url             | 118
title                   | 0
meta_data               | 0
tags                    | 30092
watermarked             | 0
last_synced_with_source | 0
SELECT 1
Time: 0.149s

deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE width IS NULL) as width,
 COUNT(*) FILTER (WHERE height IS NULL) as height,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE category IS NULL) as category
 FROM image;
-[ RECORD 1 ]-------------------------
total                   | 563667181
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 1
thumbnail               | 57584529
width                   | 12571694
height                  | 12571694
filesize                | 561894897
license_version         | 0
creator                 | 4459805
creator_url             | 22751618
title                   | 1096025
meta_data               | 366974
tags                    | 243751835
watermarked             | 1105608
last_synced_with_source | 554237
filetype                | 563004660
category                | 563622992
SELECT 1
Time: 2480.183s (41 minutes 20 seconds), executed in: 2480.182s (41 minutes 20 seconds)
AetherUnbound commented 2 years ago

A note on data refreshes & normalization that @obulat brought up: We should continue performing full data refreshes in dev until we are confident in our data normalization. Until we get everything normalized, we may continue to find issues in production that can't be replicated in staging unless we refresh the catalog in its entirety.

I've also made https://github.com/WordPress/openverse-infrastructure/issues/120 to track this