WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
215 stars 177 forks source link

Decode and deduplicate tags in the catalog with the `batched_update` DAG #4452

Open krysal opened 4 weeks ago

krysal commented 4 weeks ago

Problem

In #4143, @obulat proposed to add new cleaning steps to the fix tags in the Catalog, but the option of including them in the Ingestion Server was declined in favor of using the batched_update DAG.

It is important to fix the encoding quickly because it can cause a gray Nuxt error screen for pages that contain tags with character sequences that cannot be URI-encoded (such as udadd from "ciudaddelassiencias").

Description

We want to take the functions that were planned to include in said PR and translate them into parameters for this DAG. Given the complexity of the decoding transformation it might require some advanced functions of PostgreSQL, like a combination of pattern matching and PL/Python Functions.

In #1566, duplicated tags were previously removed, so we will apply the same solution given the decoding may cause new duplicates.

Additional context

Related to #4125 and #4199 (similar issue).

sarayourfriend commented 2 days ago

I tried to run the DAG again today, and it turns out RDS does not support the PL/Pythonu extension :disappointed:

RDS does support plperl, plrust, plpgsql, and plv8. Of those, plv8 (v8 being the same JS engine as Chrome), might be the most proximate for this use case, but I'll see. It might be that the reingestion of these records is the easiest, most reliable way to do it.