Determine which labels to exclude from Rekognition’s label set

AetherUnbound commented 2 months ago

Description

This will involve a manual process of looking through each of the available labels for Rekognition and seeing if they match any of the criteria to be filtered. This process should be completed by two maintainers, and their list of exclusions discussed & combined. The excluded labels should then be saved in an accessible location, either on S3 or within the sensitive terms repository as a new file. Consent & approval should be sought from two other maintainers on the accuracy of the exclusion list prior to publishing.

Additional context

See this section of the IP.

Final exclusions

This is the list of exclusions we've determined, based on the discussion below.

Exclusions

- Adult - Alien - Angry - Arguing - Baby - Baby Crawling - Baby Laughing - Ballerina - Beard - Bishop - Blonde - Blue Hair - Boy - Bridal Veil - Bride - Bridegroom - Bridesmaid - Brown Hair - Buddha - Building Flooding - Bun (Hairstyle) - Car Back - Damaged - Car Dent - Car Front - Damaged - Car Mirror - Broken - Car Scratch - Car Window - Broken - Child - Childbirth - Corrosion - Crucifix - Crying - Curly Hair - Family - Female - Fireman - Frown - Girl - Green Hair - Happy - Hippie - Hoe - Home Damage - Lady - Laughing - Male - Man - Mohawk Hairstyle - Mold - Mold Damage - Mustache - Newborn - Pain - Pink Hair - Pope - Prayer - Prayer Beads - Priest - Red Hair - Romantic - Roof Damage - Rust - Sad - Senior Citizen - Shouting - Shrine - Smile - Surprised - Teen - Temple - Termite Damage - Tribe - Triumphant - Window - Broken - Woman - Yawning

Corrections

For various reasons (removing gender, capitalization correction, etc.) we plan on mapping the following terms to the corrected values.

Corrections

- Atm -> ATM - Atv -> ATV - Ballerina -> Ballet Dancer - Bmx -> BMX - Cpu -> CPU - Dj -> DJ - Dvd -> DVD - Fireman -> Firefighter - Ipod -> iPod - RAM Memory -> RAM - Pc -> PC - Rv -> RV - Suv -> SUV

AetherUnbound commented 2 months ago

I have gone through the 3k labels (🥲) and pulled out the labels that I think we should exclude or consider to be excluded. Many of the "questionable" ones I think are fine, but we said (per #4662) that we would review each one so I think they're worth considering. Personally, I think they're okay to keep but open for a blocking objection there.

Exclude

- Adult - Baby - Baby Crawling - Baby Laughing - Ballerina - Barbie - Beard - Bishop - Blonde - Blue Hair - Boy - Bridal Veil - Bride - Bridegroom - Bridesmaid - Brown Hair - Child - Childbirth - Curly Hair - Exchange Of Vows - Family - Female - Fireman - Girl - Green Hair - Lady - Male - Man - Newborn - Pink Hair - Red Hair - Senior Citizen - Teen - Woman

Questionable

- Altar - Astronaut - Athlete - Attorney - Ballplayer - Bartender - Boxer - Buddha - Bullfighter - Captain - Carpenter - Chapel - Cheerleading - Chef - Chinese New Year - Christ the Redeemer - Church - Crucifix - Dentist - Dj - Doctor - Executive - Gardener - Gymnast - Hairdresser - Hanukkah Menorah - Hippie - Judge - Monk - Mosque - Musician - Nurse - Officer - Performer - Photographer - Pianist - Police Officer - Pope - Prayer - Prayer Beads - Priest - Soldier - Teacher - Veterinarian - Waiter

zackkrida commented 2 months ago

@AetherUnbound I agree with both of your lists but am also willing to contend with anyone else's blocking objections on some of the questionable terms.

krysal commented 2 months ago

On the excluded list, why are there labels like Curly Hair or Beard? I read the IP section, but the related criteria is not clear to me.

AetherUnbound commented 2 months ago

@sarayourfriend Here's the issue!

sarayourfriend commented 2 months ago

Here are my suggestions. For context (because we discussed this process in the IP), I have not looked at Madison's recommendations, though I did read Krystle's comment which mentioned two specific terms in Madison's recommendations.

Questionable

- Adult, child, teen (can be problematic when it comes to determining who is/is not an adult, cultural differences in expectations, legal definitions, etc) - Buddha: misidentification could be particularly offensive? - Temple and shrine: ditto Buddha - Doctor/nurse, mostly because of the gendered associations where doctors are assumed to be men, nurses assumed to be women; misidentification can easily perpetuate this prejudice. May be able to include it in the future if we do a manual review of works with this tag - Surgeon too? - Hippie? May have political connotations?

Exclude

- Hoe (this is on our sensitive terms list; we can add this back if we refine the sensitive term detection to ignore machine generated tags?) - Woman/Man - Male/Female - Boy/Girl - Lady (neither Lord nor Gentleman are present, so there isn't a male-gendered equivalent to exclude for this one) - Bride/Groom - Bridegroom - Bridesmaid - Fireman (weirdly gendered when firefighter exists?) - Monk (gendered in most contexts): - From Wikipedia: 'The Greek word for "monk" may be applied to men or women. In English, however, "monk" is applied mainly to men, while nun is typically used for female monastics.' - Also "nun" is not in the list? - Tribe - In an Australian context, tribe can be inappropriate or offensive, mob is preferred. Reference: https://www.ipswich.qld.gov.au/__data/assets/pdf_file/0008/10043/appropriate_indigenous_terminoloy.pdf; see the "No more classifying cultures" section. - In a North American context, there are similar complexities, although tribe is sometimes appropriate (comes from self-description). Reference: https://americanindian.si.edu/nk360/informational/impact-words-tips

Notes and observations

- Back, from Person Description: 'back' on its own is ambiguous, I wonder how this comes across in presentation - In damange detection, many tags exist like "Car Window - Broken", that will be filtered out by out other tag filtering (we exclude tags with `-` in this case) - Could gendered terms that have reliable gender neutral alternatives be manually mapped in the future? For example: - Bride and groom -> Betrothed (popular alternative, but there isn't a totally straightforward and widely used one); another alternative (which I don't like much) could be to join them "Bride/Groom" to avoid specifying _one_ gender; we could also use it as a clue towards a neutral but thematically relevant tag like "marriage", although "wedding" is present, so marriage may be duplicative, even though "wedding" is the _event_ and "marriage" is more of a conceptual theme. - Fireman -> Firefighter - Doctor and nurse -> Healthcare provider, though this isn't exactly one-to-one, a doctor is a different thing than a nurse, but the fact of whether an image depicts one or the other is probably in the upstream metadata if it's important. - Monk/Nun -> ??? - GPS and LED are capitalised, but not ATM (it is "Atm") or PC ("Pc"), weird - Iphone, Ipod are incorrectly stylised, even though iPod Shuffle is correctly stylised (with lowercase i, capital P). - "RAM Memory" is pretty annoying, the duplication of "memory"!

sarayourfriend commented 2 months ago

@AetherUnbound Can you include explanations for the terms you've suggested to exclude and ones you felt were questionable? There is some overlap between our questionable/exclude lists, and it looks like we definitely covered the umambiguously gendered terms in our exclude lists between the two of us (for which I also didn't add an explanation because they are clearly gendered). I'm very curious about some of the others, you recommended far more than I did, in particular those related to babies, childbirth, and hair style/colour.

Regarding age: I had adult, teen, and child in the questionable category, but didn't include baby or senior citizen. Baby is, I think, rather unambiguous, and I couldn't think of any culturally sensitive reasons to exclude it like I could for adult, teen, and child. What's was your reasoning? My thought process was similar for senior citizen.

For "Ballerina", I didn't include it in my lists because I thought it was gender neutral, but it looks like it's a good candidate like "fireman" for us to transpose to a gender neutral term. Fireman can easily be turned into firefighter, and ballerina could easily be "ballet dancer". I think these are unambiguous gender neutral alternatives to those words. I noted some other possible terms like this but none were as clear as firefighter and ballet dancer.

Regarding "barbie": what is the reason to exclude it? It comes from the toys category of labels, I'd expect it to only refer to the barbie style toy. Were you operating from a different assumption?

I missed "Exchange Of Vows" in my review, but likewise cannot find a reason on my own to exclude it, so would appreciate an explanation of your thought process there.

AetherUnbound commented 1 month ago

Thanks for providing your own list and notes @sarayourfriend! To answer you and @krysal - my inclusion of the other hair style/color labels were because they seemed to apply to the broader notion of "demographics". I don't feel particularly strongly about them (I was merely surprised to see them as part of the labels), so I'd be comfortable including them and removing them from my exclusions list.

As for the age related ones, "Age" was one of the categories we discussed excluding explicitly in the IP. Even though I agree that baby is fairly ambiguous, there are still some cases that might be questionable or mislabeled (would an adult in baby clothes or a diaper be labeled "baby"? I could imagine cases where the model may mislabel those). On the other side, there may be conditions that might mistakenly label a child or young adult as a senior citizen (progeria, for instance). Given the nuance there, I do think it's best to avoid age terms entirely.

"Barbie" also seemed like a gendered term to me but if we consider it as referring to the toy itself then perhaps it's not something we need to consider.

I hadn't thought about mapping terms though! That gives us an interesting opportunity to capture some of these using a more appropriate term; I'm all for it and I like your suggestions.

But your notes also bring forth an assumption I didn't realize I was making: I thought we might make all of the labels lower case before inserting them into the catalog. There indeed appears to be some information encapsulated in the capitalization though, so perhaps that may not always work. The reason I was thinking of doing this is that our tag collection endpoint is case-sensitive, and the more generic nature of these labels would be better represented without casing. Additionally, the Clarifai tags are all lower-case. What do you think? Perhaps we can lower case them if they are Caps Case, and leave as-is all other cases? (e.g. GPS or iPod)

sarayourfriend commented 1 month ago

I feel strongly that we should not lowercase everything. Aside from brand names, there are proper nouns in these tags that shouldn't be generically lower cased like "Buddha", "Christ the Redeemer" (referring to the monument in Rio de Janeiro).

The tags endpoint is case sensitive, and whether it should or shouldn't be is not a question relevant to the metadata in the catalogue, as far as I'm concerned, it's a question about how we index and search that metadata. The catalogue only provides that metadata, it does not dictate the indexed format of it with respect to search, and such a separation would be inherently and needlessly limiting to the flexibility of search. We don't need to make a decision like "lowercase all incoming tags" in the catalogue (there's no technical reason the catalogue of openly licensed works needs to do that) and whether search does that or not doesn't have anything to do with the catalogue. In fact, the catalogue should enable whatever option deemed necessary for search, whether search treats tags as case sensitive or not. Doing a blanket transformation like lowercasing the tags in the catalogue removes that flexibility from search.

I don't think we should do anything other than correct what we deem to be orthographic errors, whether those are errors like "Ipod" where the capitalisation is plainly incorrect or "GPS" and "Atm", where the error is in stylistic consistency across the incoming dataset, rather than in the single instance. I'd only say that about enrichment metadata, not provider tags, for what it's worth. We do have full control here, so while I think we shouldn't do things like lowercase all the tags in the catalogue, correcting orthographic errors or inconsistencies does seem worthwhile.

Whatever the clarifai tags are doing or did doesn't hold much sway here, I think. We didn't make decisions about that and don't know the context of them, and there's no direct reason to choose to be consistent with something we cannot explain or even know the provenance of.

On the other hand, I don't think that applies to how we do the inclusion check. For the purposes of checking whether a label should be included, I think it's fine to use a separate list of normalised labels so that e.g., using a Python of the labels makes the check O(1) for each label. To clarify, I'm only talking about label storage in the catalogue, not saying we need to at all times treat the labels in their original cases. In fact, this would solve the problem of needing to make sure our comparison of labels for the purposes of inclusion/exclusion does not affect how we fix orthographic errors, which we should do after the check (so that our orthographic fixes aren't limited by needing to keep characters in the same location in the string to use the upstream lists) but at least reduces the cognitive overhead of wondering whether those orthographic differences could affect the include/exclude logic.

if we consider [Barbie] as referring to the toy itself

I don't think it's whether we consider it to be that. Rekognition says that is the case, the label is in the toys category.

Regarding the age labels, I agree, let's filter all of them. For the hair, I might agree with hair style but I'm not sure about hair colour. On the other hand, I also don't think it's important enough to argue for including them (I'd just as well exclude all of these tags in favour of other ways of enriching our metadata that don't rely on machine vision) so let's filter them as well.

Can you explain your rationale behind excluding "Exchange of Vows"?

The exclude list would be then, your list, less Barbie, plus the addition of tribe, and hoe from my list? And then whether to exclude exchange of vows depends on understanding the rationale for it?

Also: reading the Rekognition docs, it looks like the label lists we were looking at might be different from the labels used by Rekognition at the time that they processed the dataset in the grant. That's based on this section of their documentation on aliases:

In previous versions, Amazon Rekognition Image returned aliases like 'Cell Phone' in the same list of primary label names that contained 'Mobile Phone'. Amazon Rekognition Image now returns 'Cell Phone' in a field called "aliases" and 'Mobile Phone' in the list of primary label names. If your appliction [sic] relies on the structures returned by a previous version of Rekognition, you may need to transform the current response returned by the image or video label detection operations into the previous response structure, where all labels and aliases are returned as primary labels.

(Emphasis mine)

I think this includes an assumption that labels are being added and removed, and that they might not be in the current list of tags which we've reviewed in this issue. Two things come from this:

I'd assumed based on the way we are talking about this (especially because we will create a list of excluded labels rather than included labels) that we will basically do something like a "if label not in excluded_labels" check to see if we should include them. We need to reverse this and instead create an explicit inclusion list, so that if there are any differences in the labels Rekognition was using at the time of the grant and the list that we reviewed, we won't be caught off guard. Only the labels we reviewed and said we want to include will be there.
On top of that, we should do an additional check, "if label not in included_labels and label not in all_reviewed_labels" to flag and store that in a reviewable manner (where all_reviewed_labels would be the list of labels we've reviewed in this discussion). This will make it possible to make absolutely certain that labels added to our catalogue are absolutely reviewed. We could do something like add an unreviewed note on these tags so they can go into the catalogue but be easily filtered out for the purposes of search, also allowing us to avoid needing to rerun the Rekognition ingestion process if for some reason that turns out to be arduous.

krysal commented 1 month ago

my inclusion of the other hair style/color labels were because they seemed to apply to the broader notion of "demographics". I don't feel particularly strongly about them (I was merely surprised to see them as part of the labels), so I'd be comfortable including them and removing them from my exclusions list.

@AetherUnbound Thank you for clarifying. Regarding Curly Hair at least, I see it as quite neutral; it can apply to a person regardless of gender or age, so I find it a bit strange that it is excluded, but if we go ahead on this line with hairstyles and broad demographics I found a few terms worth for considering excluding them too:

Exclude

- Bun (Hairstyle) - Mohawk Hairstyle - Mustache - Alien - Hippie

I reviewed the Person Description, Profession, Symbols, and Flags categories, as these seemed the most likely to contain something related to the indicated criteria. If labels in Religion are all considered potentially sensitive (there are only 11), it might be better to exclude them altogether as well.

sarayourfriend commented 1 month ago

FWIW I'm in favour of a broader list of exclusions, with the knowledge that we can easily reverse any of those decisions or change how we decide if a label is in/out using more context in the future. We won't make search worse by excluding "too many" labels or anything like that, and I suspect there may be greater value from using the labels in context with the existing metadata for many works than in isolation anyway.

AetherUnbound commented 1 month ago

That's some really solid logic around keeping capitalization (especially given our thoughts around how we're treating the catalog), thanks for expressing that! IIRC, our search is case insensitive which is certainly what matters the most, so I agree that we don't need to alter the casing (except in the "errors" you've pointed out).

Can you explain your rationale behind excluding "Exchange of Vows"?

Again to me, this felt like it had the propensity to be gendered. But I concede that it is more neutral, and I'm fine with leaving it out of the excluded labels.

And thanks too for your notes about the actual inclusion vs exclusion logic. I like the idea of having an unreviewed list as well! I am realizing we don't have an issue for the actual implementation of the exclusion/inclusion list, so I'll make that and link to this discussion.

The exclude list would be then, your list, less Barbie, plus the addition of tribe, and hoe from my list? And then whether to exclude exchange of vows depends on understanding the rationale for it?

If labels in Religion are all considered potentially sensitive (there are only 11), it might be better to exclude them altogether as well.

I hadn't considered using the categories as a way of doing more blanketed exclusions! That's a great idea Krystle, and I'm all for it. So to summarize, the final list is:

My exclusions (minus Barbie and Exchange of Vows)
Tribe and Hoe from Sara's list
All labels with the Religion category
The hairstyle additions from Krystle

Which would mean this is the full exclusion list:

Final Exclusions

- Adult - Alien - Baby - Baby Crawling - Baby Laughing - Ballerina - Beard - Bishop - Bishop - Blonde - Blue Hair - Boy - Bridal Veil - Bride - Bridegroom - Bridesmaid - Brown Hair - Buddha - Bun (Hairstyle) - Child - Childbirth - Crucifix - Curly Hair - Family - Female - Girl - Green Hair - Hippie - Hoe - Lady - Male - Man - Mohawk Hairstyle - Mustache - Newborn - Pink Hair - Pope - Prayer - Prayer Beads - Priest - Red Hair - Senior Citizen - Shrine - Teen - Temple - Tribe - Woman

And the orthographic/gendered corrections would be:

Corrections made during insertion

- Atm -> ATM - Pc -> PC - Iphone -> iPhone - Ipod -> iPod - Fireman -> Firefighter

@krysal @sarayourfriend @zackkrida, does the above look right?

Another thing that this has me thinking...I think I may have been assuming going into this that we were going to exclude the labels as they were being added to the catalog so they never even made it in[^1], but given the approach we've been taking with the catalog as a data warehouse, I'm not sure that's the best move anymore. What do you think? (CC @stacimc as well for just this particular paragraph in case you don't want to load in context for the rest of the convo!)

[^1]: From the IP:

This section describes the criteria used for determining which machine-generated tags we should exclude when adding any new tags to the database, and what the minimum accuracy cutoff for those tags should be.

sarayourfriend commented 1 month ago

I think I may have been assuming going into this that we were going to exclude the labels as they were being added to the catalog so they never even made it in

I definitely missed this detail from the IP and at the time would have pushed to include them in the database and filter them at the filter data step of the data refresh: indeed, to stay consistent with the data warehouse approach. However, the stakes are far lower if we don't plan to the excluded tags for any of the ideas we've discussed about them in the near-term and are making sure to keep the dataset in S3 (as recently clarified in the project thread); we can re-load the tags with new parameters and logic at any time, in that case. It would seem more consistent to have a single place where we filter data though :slightly_smiling_face:

Also: exclusion list looks good to me! I would need to go back and look at the tags again to see if there were other capitalisation or orthographic changes, I don't remember them all off the top of my head and I stopped writing them down after the first few examples.

The other note I brought up was about tags with hyphens in them, present in the tags from the "damage detection" category (presumably intended for insurance use cases?). They'll get excluded in their current form by the filter data step. Mostly wanted to just make sure that was documented so that we could address it in some way in the future if we wanted (maybe explicitly ignore tags from that category as well? or make appropriate orthographic transformations? or re-evaluate the hyphen-in-string exclusion logic?).

That's some really solid logic around keeping capitalization (especially given our thoughts around how we're treating the catalog), thanks for expressing that! IIRC, our search is case insensitive which is certainly what matters the most, so I agree that we don't need to alter the casing (except in the "errors" you've pointed out).

If the framing I used about the tag capitalisation helps clarify the role of the catalogue data compared to the data as searched, it would be a good thing for us to pull out into documentation about the architecture of our data and probably make sure the whole Openverse maintainers team are aware of it. It, like the catalogue being a data warehouse, is an important conceptual division between the cataloguing and retrieval aspects of our search domain. Our ongoing architectural discussion already bearing good fruit :blush:

AetherUnbound commented 1 month ago

I'll go ahead and modify the IP one more time to include an explicit note that we'll be inserting all available Rekognition data, but filter it during the data refresh process. Then I'll make an issue based on that to ensure that work is captured, with clarity on using an inclusion-based filter while taking note of the "unreviewed" labels that the filter encounters.

I'll also go through the Rekognition list one more time to see if there are any other capitalization/orthographic errors we'll need to mitigate.

The other note I brought up was about tags with hyphens in them...They'll get excluded in their current form by the filter data step.

I'm actually not sure this is the case; we don't have any code in the current filter step that removes tags that includes hyphens as far as I'm aware, nor in the enrich tags portion of the MediaStore. I'm not sure we'd want to remove tags with hyphens in them anyway. We could add the "damage detection" category as a whole to the list of exclusions though, that might make the most sense.

AetherUnbound commented 1 month ago

I've gone through and identified a few other corrections:

Atm -> ATM
Atv -> ATV
Bmx -> BMX
Cpu -> CPU
Dj -> DJ
Dvd -> DVD
Ipod -> iPod
RAM Memory -> RAM
Pc -> PC
Rv -> RV
Suv -> SUV

sarayourfriend commented 1 month ago

I'll go ahead and modify the IP one more time to include an explicit note that we'll be inserting all available Rekognition data, but filter it during the data refresh process. Then I'll make an issue based on that to ensure that work is captured, with clarity on using an inclusion-based filter while taking note of the "unreviewed" labels that the filter encounters.

Perfect, thanks for looking for those orthographic errors too.

we don't have any code in the current filter step that removes tags that includes hyphens as far as I'm aware

I misremembered the : exclusion as a hyphen exclusion, apologies!

https://github.com/WordPress/openverse/pull/4684/files#diff-ca2fd64dcf026d3a3b82504d64d590d63ed8857e67e2c0e3fb6f189b7ba8ffc5R42

We could add the "damage detection" category as a whole to the list of exclusions though, that might make the most sense.

Fine by me.

I mentioned this privately and forgot to share it here: shall we also exclude the expressions ("Expressions and Emotions" category), based on Lisa Feldman Barrett's research on the accuracy of human emotional perception based on facial characteristics (video of Barrett discussing the topic)? I'd particularly wonder about the negative emotions, due to the significance of stigma and gendered aspects of those judgements, but we can keep it simple at the start by excluding them altogether?

For anyone reading to give context worried about the expansiveness of the exclusion list, there are 3082 labels in Rekognition's current data set, and we're talking about excluding around 80 labels, so roughly 2.6% of the possible labels. In other words, a minuscule amount, and which says nothing about the actual effective number of labels we're excluding from the real dataset of labelled images, which may not encompass all 3082 available labels. Besides that, I believe we have justified these exclusions under the conditions discussed in the implementation plan.

AetherUnbound commented 1 month ago

That's a fair point about the expressions & emotions too - I'm also fine excluding those. I've added the final list of exclusions & corrections to the issue description at the top. I feel like we've reached a good consensus on this, which is exciting! 😊

I think it might make sense to have that list codified in the project planning documents, I'll have a PR to do that which closes this issue.

WordPress / openverse