WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
254 stars 203 forks source link

Rename `TAG_BLACKLIST` to `TAG_EXCLUDELIST` and add more tags #1412

Open krysal opened 2 years ago

krysal commented 2 years ago

Problem

Many tags from this featured image do not represent useful information.

1sthalf7thcbce
2ndquarter7thcbce
u03b1u03b3u03b3u03b5u03afu03b1
u03bau03b5u03c1u03b1u03bcu03b9u03bau03ae
...

Description

Add these tags to the TAG_BLACKLIST and the reingestion process should clean them from the existing images the next time it runs.

https://github.com/WordPress/openverse-catalog/blob/c1b970b11dbf63a45a8c2d6a33d6058a662b7ac9/openverse_catalog/dags/common/storage/media.py#L19-L34

Alternatives

Consider adding a UI and a database table to add tags in the future without modifying the code.

Implementation

AetherUnbound commented 2 years ago

I appreciate that we have an existing mechanism for this! We should probably also have a database table within the catalog that these could be added to, so we don't need to make a code change in order to add new tags that we find which we consider worth denylisting. (Also we should consider changing this from "blacklist" to "denylist" or a similar alternative).

krysal commented 2 years ago

We should probably also have a database table within the catalog that these could be added to, so we don't need to make a code change in order to add new tags that we find which we consider worth denylisting.

Great minds think alike! 😄 That was my alternative suggestion, do you know if this can be done somehow through Airflow UI? I thought of Django but it's not connected to the catalog DB (I think so , though maybe it can be ❔)

(Also we should consider changing this from "blacklist" to "denylist" or a similar alternative).

Totally 💯 "blacklist" was a bit of a weird name the first time I saw it. Changed the title to reflect that.

AetherUnbound commented 2 years ago

Great minds think alike! smile That was my alternative suggestion, do you know if this can be done somehow through Airflow UI? I thought of Django but it's not connected to the catalog DB (I think so , though maybe it can be grey_question)

Oops, got so excited to respond I didn't read through the entire issue, sorry! 😂 🤦🏼‍♀️

This would be something we could manage through the Airflow UI, but if that were the case it'd probably be a Variable unless we wanted to add some custom frontend pieces. The trouble with using a Variable is that it's all stored in 1 database field so if it was only there that record could become quite unwieldy.

What we could do though is us a Variable to allow for updating the table 😮 we could have it be a list, and then trigger an unscheduled DAG which would take the list within that Variable, add new records to a table which the ingestion process can use, then clear out the Variable for the next run! What do you think @stacimc?

obulat commented 7 months ago

The renaming to TAG_DENYLIST was completed in #4014, so the name of the issue is a bit confusing now.

@krysal, do you think it's best to update this issue's title to reflect that the issue now refers to creating a mechanism for updating the denylist as described in the quote below? Or is it better to close this issue as partly-completed, and open a new one for what's left?

This would be something we could manage through the Airflow UI, but if that were the case it'd probably be a Variable unless we wanted to add some custom frontend pieces. The trouble with using a Variable is that it's all stored in 1 database field so if it was only there that record could become quite unwieldy.

What we could do though is us a Variable to allow for updating the table 😮 we could have it be a list, and then trigger an unscheduled DAG which would take the list within that Variable, add new records to a table which the ingestion process can use, then clear out the Variable for the next run! What do you think @stacimc?