Closed AetherUnbound closed 5 months ago
@WordPress/openverse-catalog I'd like to take a shot at this issue. Am I correct to assume this should use the batched update DAG? And if so, I think I'd like to try it in two steps, as suggested, basically doing something like this:
select distinct
for each set of tags?Is such a thing possible with the batched update DAG? Are there any potentially helpful examples of how we've used that recently I could work off of?
@sarayourfriend You're correct. It's possible to do it with the batched_update
DAG. The deletion of duplicates was resolved in https://github.com/WordPress/openverse/issues/1566#issuecomment-2038338095 and Postgres has string functions for trimming.
If possible, it'd be best to combine both of those steps into a single batched update, that way we don't have to do two passes on the data! Might make for a tricky query, but then we only have to run it once 😄
I guess select distinct trimming_function(tag.name)
, or something along those lines would work?
Thanks for the input, y'all.
Reopening for the pending execution of the trim_and_deduplicate_tags
DAG.
Solved in #4557.
Description
We have some records in our data where there are duplicate tags, only the duplicate tag has leading or trailing whitespace. Here's an example: https://api.openverse.engineering/v1/images/2d454032-0cc1-48a5-8f40-e9235f1a4f12/
This might need to be tackled in two steps, or a least an operation which covers both cases:
We will also want to check, similar to #1566, that any new tags added always have extra whitespace stripped.
Additional context
Related to #430