WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
240 stars 194 forks source link

Capture upsert strategy information in media properties doc #4521

Open AetherUnbound opened 3 months ago

AetherUnbound commented 3 months ago

Description

When updating certain JSON fields in the catalog, we use a "merge" upsert strategy:

https://github.com/WordPress/openverse/blob/ca82f843d5e8097fb2310cdaaa1c8fdf3550fdab/catalog/dags/common/storage/columns.py#L70-L78

This came up in a recent discussion about tags: https://github.com/WordPress/openverse/pull/4475#issuecomment-2177205074

The catalog media properties expansion (#4366) added DB Column Type to the description of each field for easier referencing. We also add upsert_strategy to the table that gets generated at the top of the doc:

https://github.com/WordPress/openverse/blob/5041280d482bb73bd15d6641e138ec95b7487fd0/catalog/utilities/media_props_gen/helpers/column_parser.py#L52-L58

We should add an Upsert Strategy auto-generated field for each column description (similar to DB Column Type) which carries this information down to each field. It would also be helpful to have a section in the preamble/postamble which describes each update strategy and links to the code for each strategy in the columns.py file. The auto-generated field documentation can also then link to this description.

Alternatives

This was prompted by a recent mention about only new tags being added (and tags never being removed during ingestion). We could add a note explicitly to the tags column documentation instead, but it seems more prudent to do this programmatically not only because we can but because it would be more resilient going forward.

obulat commented 3 months ago

The upsert strategy is in the Python Column column of the table, because that is where we save this information:

Screenshot 2024-06-25 at 4 56 23 PM

Do you think it would be clearer to move the upsert strategy somewhere else? I do agree with the idea of adding explanations for the strategies in a postamble.

AetherUnbound commented 3 months ago

Right! I mean also surfacing it in the individual descriptions, just like media types and DB column type:

image