WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
255 stars 204 forks source link

Remove and de-duplicate tags with leading/trailing whitespace #4199

Closed AetherUnbound closed 5 months ago

AetherUnbound commented 7 months ago

Description

We have some records in our data where there are duplicate tags, only the duplicate tag has leading or trailing whitespace. Here's an example: https://api.openverse.engineering/v1/images/2d454032-0cc1-48a5-8f40-e9235f1a4f12/

"tags": [
        {
            "name": "abigfave",
            "accuracy": null
        },
        {
            "name": " abigfave",
            "accuracy": null
        },
        {
            "name": "artisanal",
            "accuracy": null
        },
        {
            "name": " artisanal",
            "accuracy": null
        },
        {
            "name": "color",
            "accuracy": null
        },
        {
            "name": " color",
            "accuracy": null
        },
        ...
    ],

This might need to be tackled in two steps, or a least an operation which covers both cases:

  1. Remove leading/trailing whitespace from existing tags
  2. Deduplicate tags which might previously have been separate ones

We will also want to check, similar to #1566, that any new tags added always have extra whitespace stripped.

Additional context

Related to #430

sarayourfriend commented 6 months ago

@WordPress/openverse-catalog I'd like to take a shot at this issue. Am I correct to assume this should use the batched update DAG? And if so, I think I'd like to try it in two steps, as suggested, basically doing something like this:

  1. Batch update to strip leading/trailing whitespace
  2. Batch update using something like select distinct for each set of tags?

Is such a thing possible with the batched update DAG? Are there any potentially helpful examples of how we've used that recently I could work off of?

krysal commented 6 months ago

@sarayourfriend You're correct. It's possible to do it with the batched_update DAG. The deletion of duplicates was resolved in https://github.com/WordPress/openverse/issues/1566#issuecomment-2038338095 and Postgres has string functions for trimming.

AetherUnbound commented 6 months ago

If possible, it'd be best to combine both of those steps into a single batched update, that way we don't have to do two passes on the data! Might make for a tricky query, but then we only have to run it once 😄

sarayourfriend commented 6 months ago

I guess select distinct trimming_function(tag.name), or something along those lines would work?

Thanks for the input, y'all.

krysal commented 5 months ago

Reopening for the pending execution of the trim_and_deduplicate_tags DAG.

krysal commented 5 months ago

Solved in #4557.