WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
248 stars 201 forks source link

Implementation Plan: Initial data dump creation #2669

Closed AetherUnbound closed 1 year ago

AetherUnbound commented 1 year ago

Description

This issue is for tracking the drafting of the first implementation plan for #2545

This implementation plan is intended to describe how to produce and provide access to the raw data dumps which will be used to create the Openverse dataset(s). Additionally, this plan should address the marketing and documentation of the initial data dump. Essentially, all facets of the project relating to the initial release.

AetherUnbound commented 1 year ago

After some cursory research, I've come across a few possible approaches:

@WordPress/openverse-maintainers, @apolinario, and @Skylion007: given our API database is a roughly 3TB Postgres RDS instance, do you have any other methods you might suggest for generating the initial dump that I should explore while drafting this?

AetherUnbound commented 1 year ago

Some interesting data from the snapshot-export-to-s3 I just ran:

Table info output from export summary ```json { "perTableStatus": [ { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:24 PM", "partitioningInfo": { "numberOfPartitions": 122, "numberOfCompletedPartitions": 122, "constraintType": "CTID", "partitionColumns": [ { "columnType": "TID", "columnName": "ctid", "min": "(0,1)", "max": "(3107,1)" } ] } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "updated_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "audio_identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "waveform_peaks", "originalType": "ARRAY", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 3.035544365644455, "target": "openledger.public.api_audioaddon" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:17 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "updated_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 0, "target": "openledger.public.api_deletedaudio" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:20 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "updated_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 8.493661880493164e-7, "target": "openledger.public.api_deletedimage" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:17 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 0.0001862943172454834, "target": "openledger.public.api_matureaudio" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:20 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 } ] }, "status": "COMPLETE", "sizeGB": 0.00002559274435043335, "target": "openledger.public.api_matureimage" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:30 PM", "partitioningInfo": { "numberOfPartitions": 139, "numberOfCompletedPartitions": 139, "constraintType": "PK", "partitionColumns": [ { "columnType": "INTEGER", "columnName": "id", "min": "1", "max": "2172271" } ], "constraintName": "audio_pkey" } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "id", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "updated_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "foreign_identifier", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "title", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "foreign_landing_url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "creator", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "creator_url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "filesize", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "watermarked", "originalType": "boolean", "expectedExportedType": "boolean", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "license", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 50, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "license_version", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 25, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "provider", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "source", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "last_synced_with_source", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "removed_from_source", "originalType": "boolean", "expectedExportedType": "boolean", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "view_count", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "tags", "originalType": "jsonb", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "meta_data", "originalType": "jsonb", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "audio_set_position", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "genres", "originalType": "ARRAY", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "category", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "duration", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "bit_rate", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "sample_rate", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "alt_files", "originalType": "jsonb", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "thumbnail", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "filetype", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "audio_set_foreign_identifier", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "standardized_popularity", "originalType": "double precision", "expectedExportedType": "double", "originalCharMaxLength": 0, "originalNumPrecision": 53, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 7.258630909025669, "target": "openledger.public.audio" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:22 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "id", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "updated_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "foreign_identifier", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "title", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "foreign_landing_url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "creator", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "creator_url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "filesize", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "filetype", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "thumbnail", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "provider", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 0.11108912527561188, "target": "openledger.public.audioset" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:31:10 PM", "partitioningInfo": { "numberOfPartitions": 41831, "numberOfCompletedPartitions": 41831, "constraintType": "PK", "partitionColumns": [ { "columnType": "INTEGER", "columnName": "id", "min": "1", "max": "773110308" } ], "constraintName": "image_pkey" } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "id", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "created_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "updated_on", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "provider", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "source", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "foreign_identifier", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "foreign_landing_url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "thumbnail", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 1000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "width", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "height", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "filesize", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "license", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 50, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "license_version", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 25, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "creator", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "creator_url", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 2000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "title", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 5000, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "last_synced_with_source", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "removed_from_source", "originalType": "boolean", "expectedExportedType": "boolean", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "meta_data", "originalType": "jsonb", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "view_count", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "tags", "originalType": "jsonb", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "watermarked", "originalType": "boolean", "expectedExportedType": "boolean", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "filetype", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "standardized_popularity", "originalType": "double precision", "expectedExportedType": "double", "originalCharMaxLength": 0, "originalNumPrecision": 53, "originalDateTimePrecision": 0 }, { "columnName": "category", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 80, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 2153.0414859727025, "target": "openledger.public.image" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:27 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "id", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "reason", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 20, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "description", "originalType": "text", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "status", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 20, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "created_at", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 } ] }, "status": "COMPLETE", "sizeGB": 0.00040696561336517334, "target": "openledger.public.nsfw_reports" }, { "tableStatistics": { "extractionStartTime": "Jul 19, 2023, 11:21:00 PM", "extractionEndTime": "Jul 19, 2023, 11:21:20 PM", "partitioningInfo": { "numberOfPartitions": 1, "numberOfCompletedPartitions": 1 } }, "schemaMetadata": { "originalTypeMappings": [ { "columnName": "id", "originalType": "integer", "expectedExportedType": "int32", "originalCharMaxLength": 0, "originalNumPrecision": 32, "originalDateTimePrecision": 0 }, { "columnName": "created_at", "originalType": "timestamp with time zone", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 6 }, { "columnName": "identifier", "originalType": "uuid", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "reason", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 20, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "description", "originalType": "text", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 0, "originalNumPrecision": 0, "originalDateTimePrecision": 0 }, { "columnName": "status", "originalType": "character varying", "expectedExportedType": "binary (UTF8)", "originalCharMaxLength": 20, "originalNumPrecision": 0, "originalDateTimePrecision": 0 } ] }, "status": "COMPLETE", "sizeGB": 0.000027455389499664307, "target": "openledger.public.nsfw_reports_audio" } ] } ```

Namely:

$ jq '.perTableStatus[] | "\(.target): \(.sizeGB) (\(.tableStatistics.partitioningInfo.numberOfPartitions) partitions)"' snapshot-export-table-info.json 
"openledger.public.api_audioaddon: 3.035544365644455 (122 partitions)"
"openledger.public.api_deletedaudio: 0 (1 partitions)"
"openledger.public.api_deletedimage: 8.493661880493164e-07 (1 partitions)"
"openledger.public.api_matureaudio: 0.0001862943172454834 (1 partitions)"
"openledger.public.api_matureimage: 2.559274435043335e-05 (1 partitions)"
"openledger.public.audio: 7.258630909025669 (139 partitions)"
"openledger.public.audioset: 0.11108912527561188 (1 partitions)"
"openledger.public.image: 2153.0414859727025 (41831 partitions)"
"openledger.public.nsfw_reports: 0.00040696561336517334 (1 partitions)"
"openledger.public.nsfw_reports_audio: 2.7455389499664307e-05 (1 partitions)"

Note that the partitions are roughly 2MB in size each

AetherUnbound commented 1 year ago

AWS also has some options for this: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postgresql-s3-export.html (although this does not appear to support parquet)

sarayourfriend commented 1 year ago

FWIW 1 TB of sc1 EBS is only 15 dollars a month. It is a "low throughput" option, but could be just fine for this application. 1 TB of st1 EBS ("high-throughput" HDD) would be 45 dollars a month. These are the two cheapest options for storage in AWS and I don't think we need SSD performance for this application (please correct me if that is wrong).

Basically: I wouldn't worry about provisioning storage space. We should compare costs for the EC2 instance(s?) and make reservations for it rather than rely on on-demand if it's more cost-effective (depending on cadence, maybe?). But overall, 3 TB is relatively tiny, especially in HDD terms and considering our reliability needs are close to nothing.

I think the second option (EC2 specifically, ECS is unnecessarily expensive for this, there aren't any ECS features we'd need for this kind of thing) sounds perfect.

This will only let us export individual tables, if we wanted to combine them with, say, our deleted/mature tables we would have to do that after the fact.

I'd vote for the first pass not to concern itself with our mature tables, but deindexed records should be excluded from the dataset if they were removed as part of a copyright violation. It's probably something to discuss in the IP, but I think other deindexation reasons aren't relevant for this. That being said, the best way to do that might be to back-port the deindexation into the catalogue database and generate the dump from there. We'd probably still want to only "soft delete" from the catalogue though so that we can prevent re-adding those works? Not sure what the effort implications are when comparing backporting soft-deindexation to the catalogue vs generating the parquet file from the API database and cross-comparing to the deleted media table.

If at that point folding in the mature data isn't substantially more effort, then including that in the dataset under something similar to our sensitivity API response to disambiguate whether it's provider supplied sensitivity (in the future) or Openverse user reported (or something else like machine tags) so that dataset users can filter more directly. Models probably shouldn't train mature content detection based on mature labelling derived from another model (IIRC that leads to model degeneration over time), so good for people to be able to exclude based on the reason for the sensitivity. To sufficiently covey the maturity information might require a lot more effort than is appropriate for the first pass.

AetherUnbound commented 1 year ago

Thanks for the input around EC2 instances. I was hoping to avoid the need for those but it seems whatever solution we approach will need both compute and storage space. Based on the cadence of this, I do think it might be possible to have the instance and other infrastructure resources associated with it entirely managed by Airflow. I could see a DAG where Airflow provisions the instance, which kicks off the process, waits for some completion criteria (or failure), then deprovisions the instance afterwards. That way, for the quarterly/yearly cadence for this project which was mentioned previously, we aren't reserving more compute/storage than is necessary.

Having this be the first iteration of the dump is also an important thing to note. It may be possible (using the CLI tools mentioned) to issue a simple query which filters out the sensitive/deleted objects. That might be the easiest approach for this first time. The back-propagation (albeit from the API to the catalog, and we're just pulling from the API here) is its own project which we plan on tackling down the line, so I want to try and avoid it being a blocker for this first attempt.

Given the above, I think the approach generally I'm landing on for drafting this is:

  1. Spin up an EC2 instance with plenty of storage space
  2. Use one of the CLI tools mentioned to stream the tables to Parquet on-disk (for now either performing a simple SELECT * FROM <media-type> or a simple outer join on the deleted/mature tables)
  3. Upload the resulting files to S3
  4. Deprovision the EC2 instance
sarayourfriend commented 1 year ago

I could see a DAG where Airflow provisions the instance, which kicks off the process, waits for some completion criteria (or failure), then deprovisions the instance afterwards.

Sounds great! I like that plan a lot. Curious to see how you approach the technical implementation of that interaction between Airflow and the EC2 instance.

Thanks for the input around EC2 instances. I was hoping to avoid the need for those

What was the reason you were hoping to avoid EC2 instances for? If it has to do with our existing Terraform EC2 instance management, keep in mind that our legacy approach to EC2 instance management in Terraform is not ideal. To anticipate the worries there, using something like Airflow, perhaps aided by a tool like Pulumi or Terraform CDK would allow us to get closer to, e.g., the Ansible workflow. There are even ways we could use an auto-scaling group that responds to SNS messages generated by Airflow to scale from 0 to x instances and back down once the process is finished.

Basically: don't let the legacy terraform modules colour your view of EC2 instance management.

The overall approach you've described sounds great and from the EC2 instance runtime side of things, is pleasantly (relatively!) simple. Getting the infrastructure management interaction right between Airflow and AWS (and potentially Terraform), is going to be the tricky bit. But I'm confident we can do it and happy to pair-research the issue with you. I've learned a lot of stuff reading about how different folks manage Elasticsearch clusters with EC2 and there are a few different approaches that I think could work without being a headache like our existing EC2-related infrastructure is.

EC2 + EBS is going to be the most cost-efficient AWS solution, unless there's a reasonable way to avoid needing any new infrastructure at all.

sarayourfriend commented 1 year ago

Oh also wanted to add that as far as which EBS and EC2 instances we choose to start with, there is going to be some calculus of choosing fast stuff that we use fewer hours of vs slower but cheaper stuff. At some point those lines intersect and the more expensive stuff is cheaper if it completes the job so much faster. But, it's easy to scale the infrastructure as far as the EC2 instance type and what kind of EBS we attach, so I wouldn't worry too much about getting the particulars right for those, just erring on the cheaper side so that we can more easily identify obvious bottlenecks like memory or CPU when we try the process out.

Skylion007 commented 1 year ago

After some cursory research, I've come across a few possible approaches:

* Use the AWS RDS snapshots we have (already in parquet format), export the relevant tables to S3, and aggregate them into fewer files (by the looks of it Amazon spits out hundreds even for smaller tables).

  * This will only let us export individual tables, if we wanted to combine them with, say, our deleted/mature tables we would have to do that after the fact.
  * The parquet files from the snapshots are already generated, so we don't have to take the time/effort to do that ourselves.

* Spin up an EC2 instance or ECS task and use a tool like [odbc2parquet](https://github.com/pacman82/odbc2parquet) or [pg2parquet](https://lib.rs/crates/pg2parquet) to stream the tables to parquet.

  * We would need to have ample space on the instance/task in order to save the parquet to disk, or have a strategy for partitioning the tables into separate parquet files

@WordPress/openverse-maintainers, @apolinario, and @Skylion007: given our API database is a roughly 3TB Postgres RDS instance, do you have any other methods you might suggest for generating the initial dump that I should explore while drafting this?

I would personally just do a Python script to dump the parquet files out with Dask + Pandas dataframes. You could do a SQL query to generate the table you want to save and then call the method to save the parquet files out from it. You can even create and Pandas dataframe from a SQL query. The Dask wrapper around Pandas ensures that the dataframe is chunked into smaller dataframes and output files to allow higher concurrency, reduce memory usage, and make the whole process pretty easy.

Alternatively, I would just run a SQL query to save the DB out to TSV files, but that would be pretty space inefficient (although running it through gzip could help).

AetherUnbound commented 1 year ago

Sounds great! I like that plan a lot. Curious to see how you approach the technical implementation of that interaction between Airflow and the EC2 instance.

Awesome! Yea I wanted to avoid it only to see if we could avoid managing an EC2 instance (through any means). Seems like that's an inevitability so I'm working with it 🙂

Oh also wanted to add that as far as which EBS and EC2 instances we choose to start with, there is going to be some calculus of choosing fast stuff that we use fewer hours of vs slower but cheaper stuff.

For sure, I actually have some comparisons on this in my draft already! Thanks for mentioning it!

I would personally just do a Python script to dump the parquet files out with Dask + Pandas dataframes.

The bigger question for me was mostly how to manage the infrastructure, but I think I've got an approach for that in mind now. The actual "to parquet" piece can be pretty easily handled by odbc2parquet; looking into it, it seems like there are a ton of config options for us in that regard! Thanks for your input 😄

Skylion007 commented 1 year ago

Sounds great! I like that plan a lot. Curious to see how you approach the technical implementation of that interaction between Airflow and the EC2 instance.

Awesome! Yea I wanted to avoid it only to see if we could avoid managing an EC2 instance (through any means). Seems like that's an inevitability so I'm working with it 🙂

Oh also wanted to add that as far as which EBS and EC2 instances we choose to start with, there is going to be some calculus of choosing fast stuff that we use fewer hours of vs slower but cheaper stuff.

For sure, I actually have some comparisons on this in my draft already! Thanks for mentioning it!

I would personally just do a Python script to dump the parquet files out with Dask + Pandas dataframes.

The bigger question for me was mostly how to manage the infrastructure, but I think I've got an approach for that in mind now. The actual "to parquet" piece can be pretty easily handled by odbc2parquet; looking into it, it seems like there are a ton of config options for us in that regard! Thanks for your input 😄

In #2702 I linked how the database dump could be entirely within Airflow using dask (or even raw Pandas if you want to write some additional code to handle the chunking yourself): https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html that should prevent the need for the EC2 instance.

AetherUnbound commented 1 year ago

Closing for the time being (see #2545).